ICAR ASRB NET BIOINFORMATICS

I HISTORY OF BIOINFORMATICS

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. "Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important subdisciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information 3 Even though the three terms: bioinformatics, computational biology and bioinformation infrastructure are often times used interchangeably, broadly, the three may be defined as follows:

1. bioinformatics refers to database-like activities, involving persistent sets of data that are maintained in a consistent state over essentially indefinite periods of time;

2. computational biology encompasses the use of algorithmic tools to facilitate biological analyses; while

3. bioinformation infrastructure comprises the entire collective of information management systems, analysis tools and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two. There are three important sub-disciplines within bioinformatics:

• the development of new algorithms and statistics with which to assess relationships among members of large data sets;

• the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures;

• and the development and implementation of tools that enable efficient access and management of different types of information Bioinformatics definition - other sources

• Bioinformatics or computational biology is the use of mathematical and informational techniques, including statistics, to solve biological problems, usually by creating or using computer programs, mathematical models or both. One of the main areas of bioinformatics is the data mining and analysis of the data gathered by the various genome projects. Other areas are sequence alignment, protein structure prediction, systems biology, protein-protein interactions and virtual evolution. (source: www.answers.com)

• Bioinformatics is the science of developing computer databases and algorithms for the purpose of speeding up and enhancing biological research. (source: www.whatis.com) • "Biologists using computers, or the other way around. Bioinformatics is more of a tool 4 than a discipline.(source: An Understandable Definition of Bioinformatics , The O'Reilly Bioinformatics Technology Conference, 2003) (4)

• The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research.(source: Webopedia)

• Bioinformatics: a combination of Computer Science, Information Technology and Genetics to determine and analyze genetic information. (Definition from BitsJournal.com)

• Bioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data.(EBI - 2can resource)

• Bioinformatics is concerned with the creation and development of advanced information and computational technologies to solve problems in biology.

• Bioinformatics uses techniques from informatics, statistics, molecular biology and high-performance computing to obtain information about genomic or protein sequence data. Bioinformatics versus a Bioinformatician A bioinformatics is an expert who not only knows how to use bioinformatics tools, but also knows how to write interfaces for effective use of the tools. A bioinformatician , on the other hand, is a trained individual who only knows to use bioinformatics tools without a deeper understanding. Aims of Bioinformatics In general, the aims of bioinformatics are three-fold.

1. The first aim of bioinformatics is to store the biological data organized in form of a database. This allows the researchers an easy access to existing information and submit new entries. These data must be annoted to give a suitable meaning or to assign its functional characteristics. The databases must also be able to correlate between different hierarchies of information. For example:

GenBank for nucleotide and protein sequence information,

Protein Data Bank for 3D macromolecular structures, etc.

2. The second aim is to develop tools and resources that aid in the analysis of data. For example:

BLAST to find out similar nucleotide/amino-acid sequences, ClustalW to align two or more nucleotide/amino-acid sequences,

Primer3 to design primers probes for PCR techniques, etc.

3. The third and the most important aim of bioinformatics is to exploit these computational tools to analyze the biological data interpret the results in a biologically meaningful manner. Goals The goals of bioinformatics thus is to provide scientists with a means to explain 1. Normal biological processes 2. Malfunctions in these processes which lead to diseases

3. Approaches to improving drug discovery To study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This includes nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:

• Development and implementation of computer programs that enable efficient access to, use and management of, various types of information

• Development of new algorithms (mathematical formulas) and statistical measures that assess relationships among members of large data sets. For example, there are methods to locate a gene within a sequence, to predict protein structure and/or function, and to cluster protein sequences into families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying 6 computationally intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine learning algorithms, and visualization. Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis.

Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Tools: Used in three areas

• Molecular Sequence Analysis

• Molecular Structural Analysis

• Molecular Functional Analysis Over the past few decades, rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology.

Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3- D models of protein structures. Bioinformatics encompasses the use of tools and techniques from three separate disciplines; molecular biology (the source of the data to be analyzed), computer science (supplies the hardware for running analysis and the networks to communicate the results), and the data analysis algorithms which strictly define bioinformatics. For this reason, the editors have decided to incorporate events from these areas into a brief history of the field.

A SHORT HISTORY OF BIOINFORMATICS 1933 A new technique, electrophoresis, is introduced by Tiselius for separating proteins in solution. 7 1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet (Proc. Natl. Acad. Sci. USA, 27: 205-211, 1951; Proc. Natl. Acad. Sci. USA, 37: 729- 740, 1951). 1953 Watson and Crick propose the double helix model for DNA based on x-ray data obtained by Franklin and Wilkins (Nature, 171: 737-738, 1953). 1954 Perutz's group develop heavy atom methods to solve the phase problem in protein crystallography. 1955 The sequence of the first protein to be analyzed, bovine insulin, is announced by F. Sanger. 1969 The ARPANET is created by linking computers at Stanford and UCLA. 1970 The details of the Needleman-Wunsch algorithm for sequence comparison are published. 1972 The first recombinant DNA molecule is created by Paul Berg and his group. 1973 The Brookhaven Protein Data Bank is announced (Acta. Cryst. B, 1973, 29: 1746). Robert Metcalfe receives his Ph.D. from Harvard University. His thesis describes Ethernet. 1974 Vint Cerf and Robert Kahn develop the concept of connecting networks of computers into an "internet" and develop the Transmission Control Protocol (TCP). 1975 Microsoft Corporation is founded by Bill Gates and Paul Allen. Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide gel is combined with separation according to isoelectric points, is announced by P. H. O'Farrell (J. Biol. Chem., 250: 4007-4021, 1975). E. M. Southern published the experimental details for the Southern Blot technique of specific sequences of DNA (J. Mol. Biol., 98: 503-517, 1975). 1977 The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is published (Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M.J.; J. Mol. Biol., 1977, 112:, 535). Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical Research Council), report methods for sequencing DNA. 1980 The first complete gene sequence for an organism (FX174) is published. The gene consists of 5,386 base pairs which code nine proteins. 8 Wuthrich et. al. publish paper detailing the use of multi-dimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wuthrich, K.; Biochem. Biophys. Res. Comm., 1980, 95:, 1). IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics Suite of programs for DNA and protein sequence analysis. 1981 The Smith-Waterman algorithm for sequence alignment is published. IBM introduces its Personal Computer to the market. 1982 Genetics Computer Group (GCG) created as a part of the University of Wisconsin of Wisconsin Biotechnology Center. The company's primary product is The Wisconsin Suite of molecular biology tools. 1983 The Compact Disk (CD) is launched. 1984 Jon Postel's Domain Name System (DNS) is placed on-line. The Macintosh is announced by Apple Computer. 1985 The FASTP algorithm is published. The PCR reaction is described by Kary Mullis and co-workers. 1986 The term "Genomics" appeared for the first time to describe the scientific discipline of mapping, sequencing, and analyzing genes. The term was coined by Thomas Roderick as a name for the new journal. Amoco Technology Corporation acquires IntelliGenetics. NSFnet debuts. The SWISS-PROT database is created by the Department of Medical Biochemistry of the University of Geneva and the European Molecular Biology Laboratory (EMBL). 1987 The use of yeast artifical chromosomes (YAC) is described (David T. Burke, et. al., Science, 236: 806-812). The physical map of E. coli is published (Y. Kohara, et. al., Cell 51: 319-337). 1988 The National Center for Biotechnology Information (NCBI) is established at the National Cancer Institute. The Human Genome Initiative is started (Commission on Life Sciences, National Research Council. Mapping and Sequencing the Human Genome, National Academy Press: Washington, D.C.), 1988. The FASTA algorithm for sequence comparison is published by Pearson and Lupman. A new program, an Internet computer virus designed by a student, infects 6,000 military computers in the US. 9 1989 The Genetics Computer Group (GCG) becomes a private company. Oxford Molecular Group, Ltd. (OMG) founded in Oxford, UK by Anthony Marchington, David Ricketts, James Hiddleston, Anthony Rees, and W. Graham Richards. Primary products: Anaconda, Asp, Cameleon and others (molecular modeling, drug design, protein design). 1990 The BLAST program (Altschul, et. al.) is implemented. Molecular Applications Group is founded in California by Michael Levitt and Chris Lee. Their primary products are Look and SegMod which are used for molecular modeling and protein design. InforMax is founded in Bethesda, MD. The company's products address sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design. 1991 The research institute in Geneva (CERN) announces the creation of the protocols which make-up the World Wide Web. The creation and use of expressed sequence tags (ESTs) is described (J. Craig Venter, et. al., Science, 252: 1651-1656). Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California, is formed. Myriad Genetics, Inc. is founded in Utah. The company's goal is to lead in the discovery of major common human disease genes and their related pathways. The Company has discovered and sequenced, with its academic collaborators, the following major genes: BRCA1, BRCA2, CHD1, MMAC1, MMSC1, MMSC2, CtIP, p16, p19, and MTS2. 1992 Human Genome Systems, Gaithersburg Maryland, is formed by William Haseltine. The Institute for Genomic Research (TIGR) is established by Craig Venter. Genome Therapeutics announces its incorporation. Mel Simon and coworkers announce the use of BACs for cloning. 1993 CuraGen Corporation is formed in New Haven, CT. Affymetrix begins independent operations in Santa Clara, California 1994 Netscape Comminications Corporation founded and releases Navigator, the commercial version of NCSA's Mozilla. Gene Logic is formed in Maryland. 10 The PRINTS database of protein motifs is published by Attwood and Beck. Oxford Molecular Group acquires IntelliGenetics. 1995 The Haemophilus influenzea genome (1.8 Mb) is sequenced. The Mycoplasma genitalium genome is sequenced. 1996 Oxford Molecular Group acquires the MacVector product from Eastman Kodak. The genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mb) is sequenced. The Prosite database is reported by Bairoch, et.al. Affymetrix produces the first commercial DNA chips. 1997 The genome for E. coli (4.7 Mbp) is published. Oxford Molecular Group acquires the Genetics Computer Group. LION bioscience AG founded as an integrated genomics company with strong focus on bioinformatics. The company is built from IP out of the European Molecular Biology Laboratory (EMBL), the European Bioinformatics Institute (EBI), the German Cancer Research Center (DKFZ), and the University of Heidelberg. Paradigm Genetics Inc., a company focussed on the application of genomic technologies to enhance worldwide food and fiber production, is founded in Research Triangle Park, NC. deCode genetics publishes a paper that described the location of the FET1 gene, which is responsible for familial essential tremor, on chromosome 13 (Nature Genetics). 1998 The genomes for Caenorhabditis elegans and baker's yeast are published. The Swiss Institute of Bioinformatics is established as a non-profit foundation. Craig Venter forms Celera in Rockville, Maryland. PE Informatics was formed as a Center of Excellence within PE Biosystems. This center brings together and leverages the complementary expertise of PE Nelson and Molecular Informatics, to further complement the genetic instrumentation expertise of Applied Biosystems. Inpharmatica, a new Genomics and Bioinformatics company, is established by University College London, the Wolfson Institute for Biomedical Research, five leading scientists from major British academic centers and Unibio Limited. GeneFormatics, a company dedicated to the analysis and prediction of protein structure and function, is formed in San Diego. 11 Molecular Simulations Inc. is acquired by Pharmacopeia 1999 deCode genetics maps the gene linked to pre-eclampsia as a locus on chromosome 2p13. 2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is published. The A. thaliana genome (100 Mb) is secquenced. The D. melanogaster genome (180Mb) is sequenced. Pharmacopeia acquires Oxford Molecular Group. 2001 The human genome (3,000 Mbp) is published. 2002 Chang Gung Genomic Research Center established. -Bioinformatics Center, -Proteomics Center, -Microarray Center Figure 1 Applications Bioinformatics joins mathematics, statistics, and computer science and information technology to solve complex biological problems. These problems are usually at the molecular level which cannot be solved by other means. This interesting field of science has many applications and research areas where it can be applied. 1950 1960 1970 1980 1990 2000 2010 2020 Key milestones 12 All the applications of bioinformatics are carried out in the user level. Here is the biologist including the students at various level can use certain applications and use the output in their research or in study. Various bioinformatics application can be categorized under following groups: Sequence Analysis Function Analysis Structure Analysis Figure 2 Sequence Analysis: All the applications that analyzes various types of sequence information and can compare between similar types of information is grouped under Sequence Analysis. Function Analysis: These applications analyze the function engraved within the sequences and helps predict the functional interaction between various proteins or genes. Also expressional analysis of various genes is a prime topic for research these days. Structure Analysis: When it comes to the realm of RNA and Proteins, its structure plays a vital role in the interaction with any other thing. This gave birth to a whole new branch termed 13 Structural Bioinformatics with is devoted to predict the structure and possible roles of these structures of Proteins or RNA Sequence Analysis: The application of sequence analysis determines those genes which encode regulatory sequences or peptides by using the information of sequencing. For sequence analysis, there are many powerful tools and computers which perform the duty of analyzing the genome of various organisms. These computers and tools also see the DNA mutations in an organism and also detect and identify those sequences which are related. Shotgun sequence techniques are also used for sequence analysis of numerous fragments of DNA. Special software is used to see the overlapping of fragments and their assembly. Prediction of Protein Structure:- It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics can also be used to determine the complex protein structures. Genome Annotation:- In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences. Comparative Genomics:- Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species. These maps contain the information about the point mutations as well as the information about the duplication of large chromosomal segments. Health and Drug discovery: The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease 14 management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes. Different computational tools and drug targets has made the drug delivery easy and specific because now only those cells can be targeted which are diseased or mutated. It is also easy to know the molecular basis of a disease. Application of Bioinformatics in various Fields Molecular medicine The human genome will have profound effects on the fields of biomedical research and clinical medicine. Every disease has a genetic component. This may be inherited (as is the case with an estimated 3000-4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or a result of the body's response to an environmental stress which causes alterations in the genome (eg. cancers, heart disease, diabetes.). The completion of the human genome means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed. Personalised medicine Clinical medicine will become more personalised with the development of the field of pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. As a result, potentially life saving drugs never make it to the marketplace. Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyse a patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning. Preventative medicine With the specific details of the genetic mechanisms of diseases being unravelled, the development of diagnostic tests to measure a persons susceptibility to different diseases may become a distinct reality. Preventative actions such as change of lifestyle or having treatment 15 at the earliest possible stages when they are more likely to be successful, could result in huge advances in our struggle to conquer disease. Gene therapy In the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a persons genes. Currently, this field is in its infantile stage with clinical trials for many different types of cancer and other diseases ongoing. Drug development At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today's medicines. Microbial genome applications Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the environment, our bodies, the air, food and water. Traditionally, use has been made of a variety of microbial properties in the baking, brewing and food industries. The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. For these reasons, in 1994, the US Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genomes of bacteria useful in energy production, environmental cleanup, industrial processing and toxic waste reduction. By studying the genetic material of these organisms, scientists can begin to understand these microbes at a very fundamental level and isolate the genes that give them their unique abilities to survive under extreme conditions. Waste cleanup Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation resistant organism known. Scientists are interested in this organism because of its potential 16 usefulness in cleaning up waste sites that contain radiation and toxic chemicals. Climate change Studies Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source. Alternative energy sources Scientists are studying the genome of the microbe Chlorobium tepidum which has an unusual capacity for generating energy from light Biotechnology The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for practical applications in industry and government-funded environmental remediation. These microorganisms thrive in water temperatures above the boiling point and therefore may provide the DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in industrial processes Other industrially useful microbes include, Corynebacterium glutamicum which is of high industrial interest as a research object because it is used by the chemical industry for the biotechnological production of the amino acid lysine. The substance is employed as a source of protein in animal nutrition. Lysine is one of the essential amino acids in animal nutrition. Biotechnologically produced lysine is added to feed concentrates as a source of protein, and is an alternative to soybeans or meat and bonemeal. Xanthomonas campestris pv. is grown commercially to produce the exopolysaccharide xanthan gum, which is used as a viscosifying and stabilising agent in many industries. Lactococcus lactis is one of the most important micro-organisms involved in the dairy industry, it is a nonpathogenic rod-shaped bacterium that is critical for manufacturing dairy products like buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis ssp., is also used to prepare pickled vegetables, beer, wine, some breads and sausages and other fermented foods. Researchers anticipate that understanding the physiology and genetic make- up of this bacterium will prove invaluable for food manufacturers as well as the pharmaceutical industry, 17 which is exploring the capacity of L. lactis to serve as a vehicle for delivering drugs. Antibiotic resistance Scientists have been examining the genome of Enterococcus faecalis-a leading cause of bacterial infection among hospital patients. They have discovered a virulence region made up of a number of antibiotic-resistant genes that may contribute to the bacterium's transformation from harmless gut bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could provide useful markers for detecting pathogenic strains and help to establish controls to prevent the spread of infection in wards. Forensic analysis of microbes Scientists used their genomic tools to help distinguish between the strain of Bacillus anthryacis that was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains. The reality of bioweapon creation Scientists have recently built the virus poliomyelitis using entirely artificial means. They did this using genomic data available on the Internet and materials from a mail-order chemical supply. The research was financed by the US Department of Defence as part of a biowarfare response program to prove to the world the reality of bioweapons. The researchers also hope their work will discourage officials from ever relaxing programs of immunisation. This project has been met with very mixed feeelings Evolutionary studies The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary studies can be performed in a quest to determine the tree of life and the last universal common ancestor. Crop improvement Comparative genetics of the plant genomes has shown that the organisation of their genes has remained more conserved over evolutionary time than was previously believed. These findings 18 suggest that information obtained from the model crop systems can be used to suggest improvements to other food crops. At present the complete genomes of Arabidopsis thaliana (water cress) and Oryza sativa (rice) are available. Insect resistance Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means that the amount of insecticides being used can be reduced and hence the nutritional quality of the crops is increased. Improve nutritional quality Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron and other micronutrients. This work could have a profound impact in reducing occurrences of blindness and anaemia caused by deficiencies in Vitamin A and iron respectively. Scientists have inserted a gene from yeast into the tomato, and the result is a plant whose fruit stays longer on the vine and has an extended shelf life. Development of Drought resistance varieties Progress has been made in developing cereal varieties that have a greater tolerance for soil alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to succeed in poorer soil areas, thus adding more land to the global production base. Research is also in progress to produce crop varieties capable of tolerating reduced water conditions. Veterinary Science Sequencing projects of many farm animals including cows, pigs and sheep are now well under way in the hope that a better understanding of the biology of these organisms will have huge impacts for improving the production and health of livestock and ultimately have benefits for human nutrition. Comparative Studies Analysing and comparing the genetic material of different species is an important method for studying the functions of genes, the mechanisms of inherited diseases and species evolution. 19 Bioinformatics tools can be used to make comparisons between the numbers, locations and biochemical functions of genes in different organisms. Organisms that are suitable for use in experimental research are termed model organisms. They have a number of properties that make them ideal for research purposes including short life spans, rapid reproduction, being easy to handle, inexpensive and they can be manipulated at the genetic level. An example of a human model organism is the mouse. Mouse and human are very closely related (>98%) and for the most part we see a one to one correspondence between genes in the two species. Manipulation of the mouse at the molecular level and genome comparisons between the two species can and is revealing detailed information on the functions of human genes, the evolutionary relationship between the two species and the molecular mechanisms of many human diseases. 20 Table 1 21 Definitions of Fields Related to Bioinformatics Bioinformatics has various applications in research in medicine, biotechnology, agriculture etc. Following research fields has integral component of Bioinformatics 1. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. 2. Genomics: Genomics is any attempt to analyze or compare the entire genetic complement of a species or species (plural). It is, of course possible to compare genomes by comparing more-or-less representative subsets of genes within genomes. 3. Proteomics: Proteomics is the study of proteins - their location, structure and function. It is the identification, characterization and quantification of all proteins involved in a particular pathway, organelle, cell, tissue, organ or organism that can be studied in concert to provide accurate and comprehensive data about that system. Proteomics is the study of the function of all expressed proteins. The study of the proteome, called proteomics, now evokes not only all the proteins in any given cell, but also the set of all protein isoforms and modifications, the interactions between them, the structural description of proteins and their higher-order complexes, and for that matter almost everything 'post-genomic'." 4. Pharmacogenomics: Pharmacogenomics is the application of genomic approaches and technologies to the identification of drug targets. In Short, pharmacogenomics is using genetic information to predict whether a drug will help make a patient well or sick. It Studies how genes influence the response of humans to drugs, from the population to the molecular level. 5. Pharmacogenetics: Pharmacogenetics is the study of how the actions of and reactions to drugs vary with the patient's genes. All individuals respond differently to drug treatments; some positively, others with little obvious change in their conditions and yet others with side effects or allergic reactions. Much of this variation is known to have a genetic basis. Pharmacogenetics is a subset of pharmacogenomics which uses genomic/bioinformatic methods to identify genomic correlates, for example SNPs (Single Nucleotide Polymorphisms), characteristic of particular patient response profiles and use those markers to inform the administration and development of therapies. Strikingly such approaches have been used to "resurrect" drugs thought previously to be ineffective, but subsequently found to work with in subset of patients 22 or in optimizing the doses of chemotherapy for particular patients. 6. Cheminformatics: Chemical informatics: 'Computer-assisted storage, retrieval and analysis of chemical information, from data to chemical knowledge.' This definition is distinct from Chemoinformatics which focus on drug design. chemometrics: The application of statistics to the analysis of chemical data (from organic, analytical or medicinal chemistry) and design of chemical experiments and simulations. computational chemistry: A discipline using mathematical methods for the calculation of molecular properties or for the simulation of molecular behavior. It also includes, e.g., synthesis planning, database searching, combinatorial library manipulation 7. Structural genomics or structural bioinformatics refers to the analysis of macromolecular structure particularly proteins, using computational tools and theoretical frameworks. One of the goals of structural genomics is the extension of idea of genomics, to obtain accurate three-dimensional structural models for all known protein families, protein domains or protein folds Structural alignment is a tool of structural genomics. 8. Comparative genomics: The study of human genetics by comparisons with model organisms such as mice, the fruit fly, and the bacterium E. coli. 9. Biophysics: The British Biophysical Society defines biophysics as: "an interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function". 10. Biomedical informatics / Medical informatics: "Biomedical Informatics is an emerging discipline that has been defined as the study, invention, and implementation of structures and algorithms to improve communication, understanding and management of medical information." 11. Mathematical Biology: Mathematical biology also tackles biological problems, but the methods it uses to tackle them need not be numerical and need not be implemented in software or hardware. It includes things of theoretical interest which are not necessarily algorithmic, not necessarily molecular in nature, and are not necessarily useful in analyzing collected data. 12. Computational chemistry: Computational chemistry is the branch of theoretical chemistry whose major goals are to create efficient computer programs that calculate the properties of molecules (such as total energy, dipole moment, vibrational 23 frequencies) and to apply these programs to concrete chemical objects. It is also sometimes used to cover the areas of overlap between computer science and chemistry. 13. Functional genomics: Functional genomics is a field of molecular biology that is attempting to make use of the vast wealth of data produced by genome sequencing projects to describe genome function. Functional genomics uses high-throughput techniques like DNA microarrays, proteomics, metabolomics and mutation analysis to describe the function and interactions of genes. 14. Pharmacoinformatics: Pharmacoinformatics concentrates on the aspects of bioinformatics dealing with drug discovery 15. In silico ADME-Tox Prediction: Drug discovery is a complex and risky treasure hunt to find the most efficacious molecule which do not have toxic effects but at the same time have desired pharmacokinetic profile. The hunt starts when the researchers look for the binding affinity of the molecule to its target. Huge amount of research requires to be done to come out with a molecule which has the reliable binding profile. Once the molecules have been identified, as per the traditional methodologies, the molecule is further subjected to optimization with the aim of improving efficacy. The molecules which show better binding is then evaluated for its toxicity and pharmacokinetic profiles. It is at this stage that most of the candidates fail in the race to become a successful drug. 16. Agroinformatics / Agricultural informatics: Agroinformatics concentrates on the aspects of bioinformatics dealing with plant genomes. INTERNET The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite (TCP/ IP) to serve billions of users worldwide. It is a network of networks that consists of millions of private, public, academic, business, and government networks, of local to global scope, that are linked by a broad array of electronic, wireless and optical networking technologies. The Internet carries a vast range of information resources and services, such as the inter- linked hypertext documents of the World Wide Web (WWW) and the infrastructure to support electronic mail. Uses of Internet 24 Internet has been the most useful technology of the modern times which helps us not only in our daily lives, but also our personal and professional lives developments. The internet helps us achieve this in several different ways. For the students and educational purposes the internet is widely used to gather information so as to do the research or add to the knowledge of various subjects. Even the business professionals and the professionals like doctors, access the internet to filter the necessary information for their use. The internet is therefore the largest encyclopedia for everyone, in all age categories. The internet has served to be more useful in maintaining contacts with friends and relatives who live abroad permanently. Advantages of Internet: E-mail: Email is now an essential communication tools in business. With e-mail you can send and receive instant electronic messages, which workslike writing letters. Your messages are delivered instantly to people anywhere in the world, unlike traditional mail that takes a lot of time. Email is free, fast and very cheap when compared to telephone, fax and postal services. 24 hours a day - 7 days a week: Internet is available, 24x7 days for usage. Information: Information is probably the biggest advantage internet is offering. There is a huge amount of information available on the internet for just about every subject, ranging from government law and services, trade fairs and conferences, market information, new ideas and technical support. You can almost find any type of data on almost any kind of subject that you are looking for by using search engines like google, yahoo, msn, etc. Online Chat: You can access many ‘chat rooms’ on the web that can be used to meet new people, make new friends, as well as to stay in touch with old friends. You can chat in MSN and yahoo websites. Services: Many services are provided on the internet like net banking, job searching, purchasing tickets, hotel reservations, guidance services on array of topics engulfing every aspect of life. Communities: Communities of all types have sprung up on the internet. Its a great way to meet up with people of similar interest and discuss common issues. 25 E-commerce: Along with getting information on the Internet, you can also shop online. There are many online stores and sites that can be used to look for products as well as buy them using your credit card. You do not need to leave your house and can do all your shopping from the convenience of your home. It has got a real amazing and wide range of products from household needs, electronics to entertainment. Entertainment: Internet provides facility to access wide range of Audio/ Video songs, plays films. Many of which can be downloaded. One such popular website is YouTube. Sof t ware Downloads: You can f r eely down load innumerable, softwares like utilities, games, music, videos, movies, etc from the Internet. Limitations of Internet Theft of Personal information: Electronic messagessent over the Internet can be easily snooped and tracked, revealing who is talking to whom and what they are talking about. If you use the Internet, your personal information such as your name, address, credit card, bank details and other information can be accessed by unauthorized persons. If you use a credit card or internet banking for online shopping, then your details can also be ‘stolen’. Negative effects on family communication: It is generally observed that due to more time spent on Internet, there is a decrease in communication and feeling of togetherness among the family members. Internet addiction: There is some controversy over whether it is possible to actually be addicted to the Internet or not. Some researchers, claim that it is simply people trying to escape their problems in an online world. Children using the Internet has become a big concern. Most parents do not realize the dangers involved when their children log onto the Internet. When children talk to others online, they do not realize they could actually be talking to a harmful person. Moreover, pornography is also a very serious issue concerning the Internet, especially when it comes to young children. There are thousands of pornographic sites on the Internet that can be easily found and can be a detriment to letting children use the Internet. Virus threat: Today, not only are humans getting viruses, but computers are also.Computers are mainly getting these viruses from the Internet. Virus is is a program which disrupts the 26 normal functioning of your computer systems. Computers attached to internet are more prone to virus attacks and they can end up into crashing your whole hard disk. Spamming: It is often viewed as the act of sending unsolicited email. This multiple or vast emailing is often compared to mass junk mailings. It needlessly obstruct the entire system. Most spam is commercial advertising, often for dubious products, get-rich-quick schemes, or quasi-legal services. Spam costs the sender very little to send — most of the costs are paid for by the recipient or the carriers rather than by the sender SERVICES OF INTERNET - E-mail, FTP, Telnet Email, discussion groups, long-distance computing, and file transfers are some of the important services provided by the Internet. Email is the fastest means of communication. With email one can also send software and certain forms of compressed digital image as an attachment. News groups or discussion groups facilitate Internet user to join for various kinds of debate, discussion and news sharing. Long-distance computing was an original inspiration for development of ARPANET and does still provide a very useful service on Internet. Programmers can maintain accounts on distant, powerful computers and execute programs. File transfer service allows Internet users to access remote machines and retrieve programs, data or text. E-Mail (Electronic Mail) E-mail or Electronic mail is a paperless method of sending messages, notes or letters from one person to another or even many people at the same time via Internet. E-mail is very fast compared to the normal post. E-mail messages usually take only few seconds to arrive at their destination. One can send messages anytime of the day or night, and, it will get delivered immediately. You need not to wait for the post office to open and you don’t have to get worried about holidays. It works 24 hours a day and seven days a week. What’s more, the copy of the message you have sent will be available whenever you want to look at it even in the middle of the night. You have the privilege of sending something extra such as a file, graphics, images etc. along with your e-mail. The biggest advantage of using e- mail is that it is cheap, especially when sending messages to other states or countries and at the same time it can be delivered to a number of people around the world. It allows you to compose note, get the address of the recipient and send it. Once the mail is received and read, it can be forwarded or replied. One can even store it for later use, or delete. In e-mail even the sender can request for delivery receipt and read receipt from the recipient. 27 Features of E-mail: ⚫ One-to-one or one-to-many communications ⚫ Instant communications ⚫ Physical presence of recipient is not required ⚫ Most inexpensive mail services, 24-hours a day and seven days a week ⚫ Encourages informal communications Components of an E-mail Address As in the case of normal mail system, e-mail is also based upon the concept of a recipient address. The email address provides all of the information required to get a message to the recipient from any where in the world. Consider the e-mail ID. john@hotmail.com In the above example john is the username of the person who will be sending/ receiving the email. Hotmail is the mail server where the username john has been registered and com is the type of organization on the internet which is hosting the mail server. FTP (File Transfer Protocol) File Transfer Protocol, is an Internet utility software used to uploaded and download files. It gives access to directories or folders on remote computers and allows software, data and text files to be transferred between different kinds of computers. FTP works on the basis of same principle as that of Client/ Server. FTP “Client” is a program running on your computer that enables you to communicate with remote computers. The FTP client takes FTP command and sends these as requests for information from the remote computer known as FTP servers. To access remote FTP server it is required, but not necessary to have an account in the FTP server. When the FTP client gets connected, FTP server asks for the identification in terms of User Login name and password of the FTP client. If one does not have an account in the remote FTP server, still he can connect to the server using anonymous login. 28 Using anonymous login anyone can login in to a FTP server and can access public archives; anywhere in the world, without having an account. One can easily Login to the FTP site with the username anonymous and e-mail address as password. Objectives of FTP: ⚫ Provide flexibility and promote sharing of computer programs, files and data ⚫ Transfer data reliably and more efficiently over network ⚫ Encourage implicit or indirect use of remote computers using Internet ⚫ Shield a user from variations in storage systems among hosts. The basic steps in an FTP session Start up your FTP client, by typing ftp on your system’s command line/ ’C>’ prompt (or, if you are in a Windows, double-click on the FTP icon). Give the FTP client an address to connect. This is the FTP server address to which the FTP client will get connected Identify yourself to the FTP remote site by giving the Login Name Give the remote site a password Remote site will verify the Login Name/ Password to allow the FTP client to access its files Look directory for files in FTP server Change Directories if required Set the transfer mode (optional); Get the file(s) you want, and Quit. 29 INTERFACE SERVER FTP Replies SERVER SYSTEM Connection Figure 3 Telnet (RemoteComputing) Telnet or remote computing is telecommunication utility software, which uses available telecommunication facility and allows you to become a user on a remote computer. Once you gain access to remote computer, you can use it for the intended purpose. The TELNET works in a very step by step procedure. The commands typed on the client computer are sent to the local Internet Service Provider (ISP), and then from the ISP to the remote computer that you have gained access. Most of the ISP provides facility to TELENET into your own account from another city and check your e-mail while you are travelling or away on business. The following steps are required for a TELNET session ⚫ Start up the TELNET program ⚫ Give the TELNET program an address to connect (some really nifty TELNET packages allow you to combine steps 1 and 2 into one simple step) ⚫ Make a note of what the “escape character” is ⚫ Log in to the remote computer, ⚫ Set the “terminal emulation” FILE SYSTEM 30 ⚫ Play around on the remote computer, and ⚫ Quit. TYPES OF INTERNET CONNECTIONS There are five types of internet connections which are as follows: (i) Dial up Connection (ii) Leased Connection (iii) DSL connection (iv) Cable Modem Connection (v) VSAT Dial up connection Dial-up refers to an Internet connection that is established using a modem. The modem connects the computer to standard phone lines, which serve as the data transfer medium. When a user initiates a dial-up connection, the modem dials a phone number of an Internet Service Provider (ISP) that is designated to receive dial-up calls. The ISP then establishesthe connection, which usually takesabout ten seconds and is accompanied by several beepings and a buzzing sound. After the dial-up connection has been established, it is active until the user disconnects from the ISP. Typically, this is done by selecting the “Disconnect” option using the ISP’s software or a modem utility program. However, if a dial-up connection is interrupted by an incoming phone call or someone picking up a phone in the house, the service may also be disconnected. Advantages Low Price Secure connection – your IP address continually changes Offered in rural areas – you need a phone line 31 Disadvantages Slow speed. Phone line is required. Busy signals for friends and family members. Leased Connection Leased connection is a permanent telephone connection between two points set up by a telecommunications common carrier. Typically, leased lines are used by businesses to connect geographically distant offices. Unlike normal dial- up connections, a leased line is always active. The fee for the connection is a fixed monthly rate. The primary factors affecting the monthly fee are distance between end points and the speed of the circuit. Because the connection doesn’t carry anybody else’s communications, the carrier can assure a given level of quality. For example, a T-1 channel is a type of leased line that provides a maximum transmission speed of 1.544 Mbps. You candivide the connection into different lines for data and voice communication or use the channel for one high speed data circuit. Dividing the connection is called multiplexing. Increasingly, leased lines are being used by companies, and even individuals, for Internet access because they afford faster data transfer rates and are cost-effective if the Internet is used heavily. Advantages • Secure and private: dedicated exclusively to the customer • Speed: symmetrical and direct • Reliable: minimum down time • Wide choice of speeds: bandwidth on demand, easily upgradeable • Leased lines are suitable for in-house office web hosting Disadvantages • Leased lines can be expensive to install and rent. 32 • Not suitable for single or home workers • Lead times can be as long as 65 working days • Distance dependent to nearest POP • Leased lines have traditionally been the more expensive access option. A Service Level Agreement (SLA) confirms an ISP’s contractual requirement in ensuring the service is maintained. This is often lacking in cheaper alternatives. DSL connection Digital Subscriber Line (DSL) is a family of technologies that provides digital data transmission over the wires of a local telephone network. DSL originally stood for digital subscriber loop. In telecommunications marketing, the term DSL is widely understood to mean Asymmetric Digital Subscriber Line (ADSL), the most commonly installed DSL technology. DSL service is delivered simultaneously with wired telephone service on the same telephone line. This is possible because DSL uses higher frequency bands for data separated by filtering. On the customer premises, a DSL filter on each outlet removes the high frequency interference, to enable simultaneous use of the telephone and data. The data bit rate of consumer DSL services typically ranges from 256 kbit/ s to 40 Mbit/ s in the direction to the customer (downstream), depending on DSL technology, line conditions, and service-level implementation. In ADSL, the data throughput in the upstream direction, (the direction to the service provider) is lower, hence the designation of asymmetric service. In Symmetric Digital Subscriber Line (SDSL) services, the downstream and upstream data rates are equal. Advantages: Security: Unlike cable modems, each subscriber can be configured so that it will not be on the same network. In some cable modem networks, other computers on the cable modem network are left visibly vulnerable and are easily susceptible to break in as well as data destruction. Integration: DSL will easily interface with ATM and WAN technology. High bandwidth 33 Cheap line charges from the phone company. Good for “bursty” traffic patterns Disadvantages No current standardization: A person moving from one area to another might find that their DSL modem is just another paperweight. Customers may have to buy new equipment to simply change ISPs. Expensive: Most customers are not willing to spend more than $20 to $25 per month for Internet access. Current installation costs, including the modem, can be as high as $750. Prices should come down within 1-3 years. As with all computer technology, being first usually means an emptier wallet. Distance Dependence: The farther you live from the DSLAM (DSL Access Multiplexer), the lower the data rate. The longest run lengths are 18,000 feet, or a little over 3 miles. Cable Modem Connection A cable modem is a type of Network Bridge and modem that provides bi-directional data communication via radio frequency channels on a HFC and RFoG infrastructure. Cable modems are primarily used to deliver broadband Internet access in the form of cable Internet, taking advantage of the high bandwidth of a HFC and RFoG network. They are commonly deployed in Australia, Europe, Asia and Americas. Figure 4 34 Figure shows the most common network connection topologies when using cable modems. The cable TV company runs a coaxial cable into the building to deliver their Internet service. Although fed from the same coax that provides cable TV service, most companies place a splitter outside of the building and runs two cables in, rather than using a splitter at the set-top box. The coax terminates at the cable modem. The cable modem itself attaches to the SOHO computing equipment via its 10BASE-T port. In most circumstances, the cable modem attaches directly to a user’s computer. If a LAN is present on the premises (something many cable companies frown upon), some sort of router can be connected to the cable modem. Advantages Always Connected: A cable modem connection is always connected to the Internet. This is advantageous because you do not have to wait for your computer to “log on” to the Internet; however, this also has the disadvantage of making your computer more vulnerable to hackers. Broadband: Cable modems transmit and receive data as digital packets, meaning they provide high-speed Internet access. This makes cable modem connections much faster than traditional dial-up connections. Bandwidth: Cable modems have the potential to receive data from their cable provider at speeds greater than 30 megabits per second; unfortunately, this speed is rarely ever realized. Cable lines are shared by all of the cable modem users in a given area; thus, the connection speed varies depending upon the number of other people using the Internet and the amount of data they are receiving or transmitting. File Transfer Capabilities: Downloads may be faster, but uploads are typically slower. Since the same lines are used to transmit data to and from the modem, priority is often given to data traveling in one direction. Signal Integrity: Cable Internet can be transmitted long distances with little signal degradation. This means the quality of the Internet signal is not significantly decreased by the distance of the modem from the cable provider. Routing: Cable routers allow multiple computers to be hooked up to one cable modem, allowing several devices to be directly connected through a single modem. Wireless routers can also be attached to your cable modem. 35 Rely on Existing Connections: Cable modems connect directly to preinstalled cable lines. This is advantageous because you do not need to have other services, such as telephone or Internet, in order to receive Internet through your cable modem. The disadvantage is that you cannot have cable internet in areas where there are no cable lines. Disadvantages Cable internet technology excels at maintaining signal strength over distance. Once it is delivered to a region, however, such as a neighborhood, it is split among that regions subscribers. While increased capacity has diminished the effect somewhat, it is still possible that users willsee significantly lowerspeeds at peak times when more people are using the shared connection. Bandwidth equals money, so cable’s advantage in throughput comes with a price. Even in plans of similar speeds compared with DSL, customers spend more per Mb with cable than they do with DSL. It’s hard to imagine, but there are still pockets of the United States without adequate cable television service. There are far fewer such pockets without residential land-line service meaning cable internet is on balance less accessible in remote areas. VSAT Short for very small aperture terminal, an earthbound station used in satellite communications of data, voice and video signals, excluding broadcast television. A VSAT consists of two parts, a transceiver that is placed outdoors in direct line of sight to the satellite and a device that is placed indoors to interface the transceiver with the end user’s communications device, such as a PC. The transceiver receives or sends a signal to a satellite transponder in the sky. The satellite sends and receives signals from a ground station computer that acts as a hub for the system. Each end user is interconnected with the hub station via the satellite, forming a star topology. The hub controls the entire operation of the network. For one end user to communicate with another, each transmission has to first go to the hub station that then retransmits it via the satellite to the other end user’s VSAT. Advantages Satellite communication systems have some advantages that can be exploited for the provision 36 of connectivity. These are: • Costs Insensitive to Distance • Single Platform service delivery (one-stop-shop) • Flexibility • Upgradeable • Low incremental costs per unit Disadvantages However like all systems there are disadvantages also. Some of these are • High start-up costs (hubs and basic elements must be in place before the services can be provided) • Higher than normal risk profiles • Severe regulatory restrictions imposed by countries that prevent VSAT networks and solutions from reaching critical mass and therefore profitability • Some service quality limitations such the high signal delays (latency) • Natural availability limitsthat cannot be mitigated against • Lack of skills required in the developing world to design, install and maintain satellite communication systems adequately DOWNLOADING FILES Downloading is the process of copying a file (such as a game or utility) from one computer to another across the internet. When you download a game from our web site, it means you are copying it from the author or publisher’s web server to your own computer. This allows you to install and use the program on your own machine. Here’s how to download a file using Internet Explorer and Windows XP. (This example shows a download of the file “dweepsetup.exe” from Dexterity Games.) If you’re using a different browser such as Netscape Navigator or a different version of Windows, your screen may look a little different, but the same basic steps should work. 37 Click on the download link for the program you want to download. Many sites offer multiple download links to the same program, and you only need to choose one of these links. You may be asked if you want to save the file or run it from its current location. If you are asked this question, select “ Save.” I f not, don’t worr y — som e br owsers will automatically choose “Save” for you. You will then be asked to select the folder where you want to save the program or file, using a standard “Save As” dialog box. Pay attention to which folder you select before clicking the “Save” button. It may help you to create a folder like “C:\ Download” for all of your downloads, but you can use any folder you’d like. The download will now begin. Your web browser will keep you updated on the progress of the download by showing a progress bar that fills up as you download. You will also be reminded where you’re saving the file. The file will be saved as “C:\ Download\ dweepsetup.exe” in the picture below. Note: You may also see a check box labeled “Close this dialog box when download completes.” If you see this check box, it helps to uncheck this box. You don’t have to, but if you do, it will be easier to find the file after you download it. Depending on which file you’re downloading and how fast your connection is, it may take anywhere from a few seconds to a few minutes to download. When your download is finished, if you left the “Close this dialog box when download completes” option unchecked, you’ll see a dialog box as shown in fig. : Figure 5 a 38 Figure 5b Figure 5c Now click the “Open” button to run the file you just downloaded. If you don’t see the “Download complete” dialog box, open the folder where you saved the file and double-click on the icon for the file there. What happens next will depend on the type of file you downloaded. The files you’ll download most often will end in one oftwo extensions. (An extension isthe lastfew letters of the filename, after the period.) They are: .EXE files: The file you downloaded is a program. Follow the on-screen instructions from there to install the program to your computer and to learn how to run the program after it’s installed. .ZIP files: ZIP is a common file format used to compress and combine files to make them download more quickly. Some versions of Windows (XP and sometimes ME) can read ZIP fi les without extra software. Otherwise, you will need an unzipping program to read these ZIP files. Common unzipping programs are WinZip, PKZIP, and Bit Zipper, but there are also many others. Many unzipping programs are shareware, which means you will need to purchase them if you use them beyond their specified trial period. World Wide Web What is the Internet? What is the World Wide Web? How are they related? The Internet is an international network (a collection of connected, in this case, computers) – networked for the purpose of communication of information. The Internet offers many software services for this purpose, including: 39 • World Wide Web • E-mail • Instant messaging, chat • Telnet (a service that lets a user login to a remote computer that the user has login privileges for) • FTP (File Transfer Protocol) – a service that lets one use the Internet to copy files from one computer to another The Web was originally designed for the purpose of displaying “public domain” data to anyone who could view it. Although this is probably the most popular use of the Web today, other uses of the Web include: • Research, using tools such as “search engines” to find desired information. • A variety of databases are available on the Web (this is another “research” tool). One example of such a database: a library’s holdings. • Shopping – most sizable commercial organizations have Web sites with forms you can fill out to specify goods or services you wish to purchase. Typically, you must include your credit card information in this form. Typically, your credit card information is safe – the system is typically automated so no human can see (and steal) your credit card number. • We can generalize the above: Web forms can be filled out and submitted to apply for admission to a university, to give a donation to a charity, to apply for a job, to become a member of an organization, do banking chores, pay bills, etc. • Listen to music or radio-like broadcasts, view videos or tv-like broadcasts. • Some use the Web to access their e-mail or bulletin board services such as Blackboard. • Most “browsers” today are somewhat like operating systems, in that they can enable a variety of application programs. For example, a Word, Excel, PowerPoint document can be placed on the Web and viewed in its “native” application. Some terminology you should know: • Browser: A program used to view Web documents. Popular browsers include Microsoft Internet Explorer (IE), Netscape, Opera; an old text-only browser called Lynx is still around on some systems; etc. The browsers of Internet Service Providers (ISPs) like AOL, Adelphia, Juno, etc., are generally one of the above, with the ISP’s logo displayed. Most browsers work alike, today. There may be minor (for example, what IE calls “Favorites,” Netscape calls “Bookmarks”). 40 • A Web document is called a “page.” A collection of related pages is a “site.” A Web site typically has a “home page” designed to be the first, introductory, page a user of the site views. • A Web page typically has an “address” or URL (Universal Reference Locator). You can view a desired page by using any of several methods to inform your browser of the URL whose page you wish to view. The home page of a site typically has a URL of the form http://www.DomainName.suffix where the “DomainName” typically tells you something about the identity of the “host” or “owner” of the site, and the “suffix” typically tells either the type of organization of the owner or its country. Some common suffixes include: ✓ edu – An educational institution, usually a college or university. ✓ com – A commercial site – a company ✓ gov – a government site ✓ org – an organization that’s non-profit ✓ net – an alternative to “com” for network service providers Also, the Internet originally was almost entirely centered in the US. As it spread to other countries, it became common for sites outside the US to use a suffix that’s a 2-letter country abbreviation: “ca” (without quotation marks) for Canada; “it” for Italy; “mx” for Mexico; etc. A page that isn’t a home page will typically have an address that starts with its site’s home page address, and has appended further text to describe the page. For example, the Niagara University home page is at http://www.niagara.edu/ and the Niagara University Academics page is at http://www.niagara.edu/academic.htm. Navigating: • One way to reach a desired page is to enter its URL in the “Address” textbox. • You can click on a link (usually underlined text, or a graphic may also serve as a link; notice that the mouse cursor changes its symbol, typically to a hand, when hovering over a Web link) to get to the page addressed by the link. • The Back button may be used to retrace your steps, revisiting pages recently visited. • You can click the Forward button to retrace your steps through pages recently Backed out of. 41 • Notice the drop-down button at the right side of the Address textbox. This reveals a menu of URLs recently visited by users of the browser on the current computer. You may click one of these URLs to revisit its page. • Favorites (what Netscape calls “Bookmarks”) are URLs saved for the purpose of making revisits easy. If you click a Favorite, you can easily revisit the corresponding page. How do we find information on the Web? Caution: Don’t believe everything you see on the Web. Many Web sites have content made up of hate literature, political propaganda, unfounded opinions, and other content of dubious reliability. Therefore, you should try to use good judgment about the sites you use for research. Strategies for finding information on the Web include: • Often, you can make an intelligent guess at the URL of a desired site. For example, you might guess the UB Web site is http://www.ub.edu (turned out to be the University of Barcelona) or http://www.buffalo.edu (was correct); similarly, if you’re interested in the IRS Web site, you might try http://www.irs.gov – and it works. Similarly, you might try, for Enron, we might try http://www.enron.com – and this redirected us to the page http://www.enron.com/corp/. • “Search engines” are Web services provided on a number of Web sites, allowing you to enter a keyword or phrase describing the topic you want information for. You may then click a button to activate the search. A list of links typically appears, and you may explore these links to find (you hope) the information you want. Note: if you use a phrase of multiple words, and don’t place that phrase in quotation marks, you may get links by virtue of matching all the words separately – e.g., “Diane” and “Pilarski” separately appeared in a document that matched the phrase “Diane Pilarski” without quotation marks; but the same link did not appear when we searched for “Diane Pilarski” with quotation marks. Also, you may find if the phrase you enter is someone’s name, that many people have the same name. • Another strategy: Some Web sites (including some that offer search engines) have “Web directories” or “indices” – classifications of Web pages. A good example: The Yahoo! site at http://www.yahoo.com has such a Web directory. You can work your way through the directory, often, to find desired information. 42 WEB BROWSER A web browser (commonly referred to as a browser) is a software application for accessing information on the World Wide Web. Each individual web page, image, and video is identified by a distinct Uniform Resource Locator (URL), enabling browsers to retrieve these resources from a web server and display them on a user's device. A web browser is not the same thing as a search engine, though the two are often confused. For a user, a search engine is just a website, such as google.com, that stores searchable data about other websites. But to connect to a website's server and display its web pages, a user must have a web browser installed on their device. As of March 2019, more than 4.3 billion people use a browser, which is about 55% of the world’s population. The most popular browsers are Chrome, Firefox, Safari, Internet Explorer, and Edge. History The first web browser, called WorldWideWeb, was created in 1990 by Sir Tim Berners-Lee. He then recruited Nicola Pellow to write the Line Mode Browser, which displayed web pages on dumb terminals; it was released in 1991. Nicola Pellow and Tim Berners-Lee in their office at CERN. Marc Andreessen, lead developer of Mosaic and Navigator 1993 was a landmark year with the release of Mosaic, credited as "the world's first popular browser". Its innovative graphical interface made the World Wide Web system easy to use and thus more accessible to the average person. This, in turn, sparked the Internet boom of the 1990s when the Web grew at a very rapid rate. Marc Andreessen, the leader of the Mosaic team, soon started his own company, Netscape, which released the Mosaic-influenced Netscape Navigator in 1994. Navigator quickly became the most popular browser. Microsoft debuted Internet Explorer in 1995, leading to a browser war with Netscape. Microsoft was able to gain a dominant position for two reasons: it bundled Internet Explorer 43 with its popular Microsoft Windows operating system and did so as freeware with no restrictions on usage. Eventually the market share of Internet Explorer peaked at over 95% in 2002. WorldWideWeb was the first web browser. In 1998, desperate to remain competitive, Netscape launched what would become the Mozilla Foundation to create a new browser using the open source software model. This work evolved into Firefox, first released by Mozilla in 2004. Firefox reached a 28% market share in 2011. Apple released its Safari browser in 2003. It remains the dominant browser on Apple platforms, though it never became a factor elsewhere. The last major entrant to the browser market was Google. Its Chrome browser, which debuted in 2008, has been a huge success. It steadily took market share from Internet Explorer and became the most popular browser in 2012. Chrome has remained dominant ever since. In terms of technology, browsers have greatly expanded their HTML, CSS, JavaScript, and multimedia capabilities since the 1990s. One reason has been to enable more sophisticated websites, such as web applications. Another factor is the significant increase of broadband connectivity, which enables people to access data-intensive web content, such as YouTube streaming, that was not possible during the era of dial-up modems. Function The purpose of a web browser is to fetch information resources from the Web and display them on a user's device. This process begins when the user inputs a URL, such as https://en.wikipedia.org/, into the browser. Virtually all URLs on the Web start with either http: or https: which means the browser will retrieve them with the Hypertext Transfer Protocol. In the case of https:, the communication between the browser and the web server is encrypted for the purposes of security and privacy. Another URL prefix is file: which is used to display local files already stored on the user's device. Once a web page has been retrieved, the browser's rendering engine displays it on the user's 44 device. This includes image and video formats supported by the browser. Web pages usually contain hyperlinks to other pages and resources. Each link contains a URL, and when it is clicked, the browser navigates to the new resource. Thus the process of bringing content to the user begins again. Settings Web browsers can typically be configured with a built-in menu. Depending on the browser, the menu may be named Settings, Options, or Preferences. The menu has different types of settings. For example, users can change their home page and default search engine. They also can change default web page colors and fonts. Various network connectivity and privacy settings are also usually available. Privacy During the course of browsing, cookies received from various websites are stored by the browser. Some of them contain login credentials or site preferences. However, others are used for tracking user behavior over long periods of time, so browsers typically provide settings for removing cookies when exiting the browser.Finer-grained management of cookies requires a browser extension. Features The most popular browsers have a number of features in common. They allow users to set bookmarks and browse in a private mode. They also can be customized with extensions, and some of them provide a sync service. Most browsers have these user interface features: Allow the user to open multiple pages at the same time, either in different browser windows or in different tabs of the same window. Back and forward buttons to go back to the previous page visited or forward to the next one. A refresh or reload button to reload the current page. A stop button to cancel loading the page. (In some browsers, the stop button is merged with the reload button.) A home button to return to the user's home page. 45 An address bar to input the URL of a page and display it. A search bar to input terms into a search engine. (In some browsers, the search bar is merged with the address bar.) There are also niche browsers with distinct features. One example is text-only browsers that can benefit people with slow Internet connections or those with visual impairments. Security Web browsers are popular targets for hackers, who exploit security holes to steal information, destroy files, and other malicious activity. Browser vendors regularly patch these security holes, so users are strongly encouraged to keep their browser software updated. Other protection measures are antivirus software and avoiding known-malicious websites. EMBnet The European Molecular Biology network (EMBnet) is an international scientific network and interest group that aims to enhance bioinformatics services by bringing together bioinformatics expertises and capacities. On 2011 EMBnet has 37 nodes spread over 32 countries. The nodes include bioinformatics related university departments, research institutes and national service providers. Operations The main task of most EMBnet nodes is to provide their national scientific community with access to bioinformatics databanks, specialised software and sufficient computing resources and expertise. EMBnet is also working in the fields of bioinformatics training and software development. Examples of software created by EMBnet members are: EMBOSS, wEMBOSS, UTOPIA. EMBnet represents a wide user group and works closely together with the database producers such as EMBL's European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (Swiss-Prot), the Munich Information Center for Protein Sequences (MIPS), in order to provide a uniform coverage of services throughout Europe. EMBnet is registered in the Netherlands as a public foundation (Stichting). Since its creation in 1988, EMBnet has evolved from an informal network of individuals in 46 charge of maintaining biological databases into the only worldwide organization bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. Although composed predominantly of academic nodes, EMBnet gains an important added dimension from its industrial members. The success of EMBnet is attracting increasing numbers of organizations outside Europe to join. EMBnet has a tried-and-tested infrastructure to organise training courses, give technical help and help its members effectively interact and respond to the rapidly changing needs of biological research in a way no single institute is able to do. In 2005 the organization created additional types of node to allow more than one member per country. The new category denomination is "associated node". Coordination and organization EMBnet is governed by the Annual General Meetings (AGM), and is coordinated by an Executive Board (EB) that oversees the activities of three project committees: Education and Training committee (E&T). Educational support includes a series of courses organised in the member countries and languages, the committee works as well on the continued development of on-line accessible education materials. Publicity and Public Relations committee (P&PR). This committee is responsible for promoting any type of EMBnet activities, for the advertisement of products and services provided by the EMBnet community, as well as for proposing and developing new strategies aiming to enhance EMBnet’s visibility, and to take care of public relationships with EMBnet communities and related networks/societies. Technical Manager committee (TM). The TM PC provides assistance and practical help to the participating nodes and their users. THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION (NCBI) The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper. 47 The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrez search engine. NCBI was directed by David Lipman, one of the original authors of the BLAST sequence alignment program and a widely respected figure in bioinformatics. He also led an intramural research program, including groups led by Stephen Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017. GenBank NCBI has had responsibility for making available the GenBank DNA sequence database since 1992.GenBank coordinates with individual laboratories and other sequence databases such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ). Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI assigns a unique identifier (taxonomy ID number) to each species of organism. The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds. NCBI Bookshelf The "NCBI Bookshelf is a collection of freely accessible, downloadable, on-line versions of selected biomedical books. The Bookshelf covers a wide range of topics including molecular biology, biochemistry, cell biology, genetics, microbiology, disease states from a molecular and cellular point of view, research methods, and virology. Some of the books are online versions of previously published books, while others, such as Coffee Break, are written and edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository of peer- 48 reviewed publication abstracts in that Bookshelf contents provide established perspectives on evolving areas of study and a context in which many disparate individual pieces of reported research can be organized. Basic Local Alignment Search Tool (BLAST) BLAST is an algorithm used for calculating sequence similarity between biological sequences such as nucleotide sequences of DNA and amino acid sequences of proteins. BLAST is a powerful tool for finding sequences similar to the query sequence within the same organism or in different organisms. It searches the query sequence on NCBI databases and servers and post the results back to the person's browser in chosen format. Input sequences to the BLAST are mostly in FASTA or Genbank format while output could be delivered in variety of formats such as HTML, XML formatting and plain text. HTML is the default output format for NCBI's web-page. Results for NCBI-BLAST are presented in graphical format with all the hits found, a table with sequence identifiers for the hits having scoring related data, along with the alignments for the sequence of interest and the hits received with analogous BLAST scores for these Entrez The Entrez Global Query Cross-Database Search System is used at NCBI for all the major databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy, Complete Genomes, OMIM, and several others. Entrez is both indexing and retrieval system having data from various sources for biomedical research. NCBI distributed the first version of Entrez in 1991, composed of nucleotide sequences from PDB and GenBank, protein sequences from SWISS-PROT, translated GenBank, PIR, PRF , PDB and associated abstracts and citations from PubMed. Entrez is specially designed to integrate the data from several different sources, databases and formats into a uniform information model and retrieval system which can efficiently retrieve that relevant references, sequences and structures. Gene Gene has been implemented at NCBI to characterize and organize the information about genes. It serves as a major node in the nexus of genomic map, expression, sequence, protein function, structure and homology data. A unique GeneID is assigned to each gene record that can be followed through revision cycles. Gene records for known or predicted genes are established here and are demarcated by map positions or nucleotide sequence. Gene has several advantages 49 over its predecessor, LocusLink, including, better integration with other databases in NCBI, broader taxonomic scope, and enhanced options for query and retrieval provided by Entrez system. Protein Protein database maintains the text record for individual protein sequences, derived from many different resources such as NCBI Reference Sequence (RefSeq) project, GenbBank, PDB and UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and XML and are linked to other NCBI resources. Protein provides the relevant data to the users such as genes, DNA/RNA sequences, biological pathways, expression and variation data and literature. It also provides the pre-determined sets of similar and identical proteins for each sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate sets for experimentally-determined structures in PDB that are imported by NCBI. The Conserved Domain database (CDD) of protein contains sequence profiles that characterize highly conserved domains within protein sequences. It also has records from external resources like SMART and Pfam. There is another database in protein known as Protein Clusters database which contains sets of proteins sequences that are clustered according to the maximum alignments between the individual sequences as calculated by BLAST. Pubchem database PubChem database of NCBI is a public resource for molecules and their activities against biological assays. PubChem is searchable and accessible by Entrez information retrieval system. FILE TRANSFER PROTOCOL The File Transfer Protocol (FTP) is a standard network protocol used for the transfer of computer files between a client and server on a computer network. FTP is built on a client-server model architecture using separate control and data connections between the client and the server. FTP users may authenticate themselves with a clear-text sign-in protocol, normally in the form of a username and password, but can connect anonymously if the server is configured to allow it. For secure transmission that protects the 50 username and password, and encrypts the content, FTP is often secured with SSL/TLS (FTPS) or replaced with SSH File Transfer Protocol (SFTP). The first FTP client applications were command-line programs developed before operating systems had graphical user interfaces, and are still shipped with most Windows, Unix, and Linux operating systems. Many FTP clients and automation utilities have since been developed for desktops, servers, mobile devices, and hardware, and FTP has been incorporated into productivity applications, such as HTML editors. History of FTP servers The original specification for the File Transfer Protocol was written by Abhay Bhushan and published as RFC 114 on 16 April 1971. Until 1980, FTP ran on NCP, the predecessor of TCP/IP. The protocol was later replaced by a TCP/IP version, RFC 765 (June 1980) and RFC 959 (October 1985), the current specification. Several proposed standards amend RFC 959, for example RFC 1579 (February 1994) enables Firewall-Friendly FTP (passive mode), RFC 2228 (June 1997) proposes security extensions, RFC 2428 (September 1998) adds support for IPv6 and defines a new type of passive mode. Protocol overview Communication and data transfer Illustration of starting a passive connection using port 21 FTP may run in active or passive mode, which determines how the data connection is established. In both cases, the client creates a TCP control connection from a random, usually an unprivileged, port N to the FTP server command port 21. In active mode, the client starts listening for incoming data connections from the server on port M. It sends the FTP command PORT M to inform the server on which port it is listening. The server then initiates a data channel to the client from its port 20, the FTP server data port. In situations where the client is behind a firewall and unable to accept incoming TCP connections, passive mode may be used. In this mode, the client uses the control connection to send a PASV command to the server and then receives a server IP address and server port number from the server, which the client then uses to open a data connection from an arbitrary client port to the server IP address and server port number received Both modes were updated in September 1998 to support IPv6. Further changes were introduced 51 to the passive mode at that time, updating it to extended passive mode. The server responds over the control connection with three-digit status codes in ASCII with an optional text message. For example, "200" (or "200 OK") means that the last command was successful. The numbers represent the code for the response and the optional text represents a human-readable explanation or request (e.g. ) An ongoing transfer of file data over the data connection can be aborted using an interrupt message sent over the control connection. While transferring data over the network, four data representations can be used: ASCII mode: Used for text. Data is converted, if needed, from the sending host's character representation to "8-bit ASCII" before transmission, and (again, if necessary) to the receiving host's character representation. As a consequence, this mode is inappropriate for files that contain data other than plain text. Image mode (commonly called Binary mode): The sending machine sends each file byte by byte, and the recipient stores the bytestream as it receives it. (Image mode support has been recommended for all implementations of FTP). EBCDIC mode: Used for plain text between hosts using the EBCDIC character set. Local mode: Allows two computers with identical setups to send data in a proprietary format without the need to convert it to ASCII. For text files, different format control and record structure options are provided. These features were designed to facilitate files containing Telnet or ASA. Data transfer can be done in any of three modes: Stream mode: Data is sent as a continuous stream, relieving FTP from doing any processing. Rather, all processing is left up to TCP. No End-of-file indicator is needed, unless the data is divided into records. Block mode: FTP breaks the data into several blocks (block header, byte count, and data field) and then passes it on to TCP. Compressed mode: Data is compressed using a simple algorithm (usually run-length encoding). Some FTP software also implements a DEFLATE-based compressed mode, sometimes called "Mode Z" after the command that enables it. This mode was described in an Internet Draft, but not standardized 52 Login FTP login uses normal username and password scheme for granting access. The username is sent to the server using the USER command, and the password is sent using the PASS command. This sequence is unencrypted "on the wire", so may be vulnerable to a network sniffing attack. If the information provided by the client is accepted by the server, the server will send a greeting to the client and the session will commence. If the server supports it, users may log in without providing login credentials, but the same server may authorize only limited access for such sessions. Anonymous FTP A host that provides an FTP service may provide anonymous FTP access. Users typically log into the service with an 'anonymous' (lower-case and case-sensitive in some FTP servers) account when prompted for user name. Although users are commonly asked to send their email address instead of a password, no verification is actually performed on the supplied data. Many FTP hosts whose purpose is to provide software updates will allow anonymous logins. NAT and firewall traversal FTP normally transfers data by having the server connect back to the client, after the PORT command is sent by the client. This is problematic for both NATs and firewalls, which do not allow connections from the Internet towards internal hosts. For NATs, an additional complication is that the representation of the IP addresses and port number in the PORT command refer to the internal host's IP address and port, rather than the public IP address and port of the NAT. There are two approaches to solve this problem. One is that the FTP client and FTP server use the PASV command, which causes the data connection to be established from the FTP client to the server. This is widely used by modern FTP clients. Another approach is for the NAT to alter the values of the PORT command, using an application-level gateway for this purpose. Differences from HTTP HTTP essentially fixes the bugs in FTP that made it inconvenient to use for many small ephemeral transfers as are typical in web pages. 53 FTP has a stateful control connection which maintains a current working directory and other flags, and each transfer requires a secondary connection through which the data are transferred. In "passive" mode this secondary connection is from client to server, whereas in the default "active" mode this connection is from server to client. This apparent role reversal when in active mode, and random port numbers for all transfers, is why firewalls and NAT gateways have such a hard time with FTP. HTTP is stateless and multiplexes control and data over a single connection from client to server on well-known port numbers, which trivially passes through NAT gateways and is simple for firewalls to manage. Setting up an FTP control connection is quite slow due to the round-trip delays of sending all of the required commands and awaiting responses, so it is customary to bring up a control connection and hold it open for multiple file transfers rather than drop and re-establish the session afresh each time. In contrast, HTTP originally dropped the connection after each transfer because doing so was so cheap. While HTTP has subsequently gained the ability to reuse the TCP connection for multiple transfers, the conceptual model is still of independent requests rather than a session. When FTP is transferring over the data connection, the control connection is idle. If the transfer takes too long, the firewall or NAT may decide that the control connection is dead and stop tracking it, effectively breaking the connection and confusing the download. The single HTTP connection is only idle between requests and it is normal and expected for such connections to be dropped after a time-out. Web browser support Most common web browsers can retrieve files hosted on FTP servers, although they may not support protocol extensions such as FTPS.When an FTP—rather than an HTTP—URL is supplied, the accessible contents on the remote server are presented in a manner that is similar to that used for other web content. A full-featured FTP client can be run within Firefox in the form of an extension called FireFTP. Syntax FTP URL syntax is described in RFC 1738, taking the form: ftp://[user[:password]@]host[:port]/url-path (the bracketed parts are optional). 54 For example, the URL ftp://public.ftp-servers.example.com/mydirectory/myfile.txt represents the file myfile.txt from the directory mydirectory on the server public.ftp-servers.example.com as an FTP resource. The URL ftp://user001:secretpassword@private.ftpservers.example.com/mydirectory/myfile.txt adds a specification of the username and password that must be used to access this resource. More details on specifying a username and password may be found in the browsers' documentation (e.g., Firefox and Internet Explorer). By default, most web browsers use passive (PASV) mode, which more easily traverses end-user firewalls. Some variation has existed in how different browsers treat path resolution in cases where there is a non-root home directory for a user. Security FTP was not designed to be a secure protocol, and has many security weaknesses. In May 1999, the authors of RFC 2577 listed a vulnerability to the following problems: Brute force attack FTP bounce attack Packet capture Port stealing (guessing the next open port and usurping a legitimate connection) Spoofing attack Username enumeration DoS or DDoS FTP does not encrypt its traffic; all transmissions are in clear text, and usernames, passwords, commands and data can be read by anyone able to perform packet capture (sniffing) on the network. This problem is common to many of the Internet Protocol specifications (such as SMTP, Telnet, POP and IMAP) that were designed prior to the creation of encryption mechanisms such as TLS or SSL. Common solutions to this problem include: Using the secure versions of the insecure protocols, e.g., FTPS instead of FTP and TelnetS instead of Telnet. Using a different, more secure protocol that can handle the job, e.g. SSH File Transfer Protocol or Secure Copy Protocol. 55 Using a secure tunnel such as Secure Shell (SSH) or virtual private network (VPN). FTP over SSH FTP over SSH is the practice of tunneling a normal FTP session over a Secure Shell connection. Because FTP uses multiple TCP connections (unusual for a TCP/IP protocol that is still in use), it is particularly difficult to tunnel over SSH. With many SSH clients, attempting to set up a tunnel for the control channel (the initial client-to-server connection on port 21) will protect only that channel; when data is transferred, the FTP software at either end sets up new TCP connections (data channels) and thus have no confidentiality or integrity protection. Otherwise, it is necessary for the SSH client software to have specific knowledge of the FTP protocol, to monitor and rewrite FTP control channel messages and autonomously open new packet forwardings for FTP data channels. Software packages that support this mode include: Tectia ConnectSecure (Win/Linux/Unix) of SSH Communications Security's software suite Derivatives FTPS Explicit FTPS is an extension to the FTP standard that allows clients to request FTP sessions to be encrypted. This is done by sending the "AUTH TLS" command. The server has the option of allowing or denying connections that do not request TLS. This protocol extension is defined in RFC 4217. Implicit FTPS is an outdated standard for FTP that required the use of a SSL or TLS connection. It was specified to use different ports than plain FTP. SSH File Transfer Protocol The SSH file transfer protocol (chronologically the second of the two protocols abbreviated SFTP) transfers files and has a similar command set for users, but uses the Secure Shell protocol (SSH) to transfer files. Unlike FTP, it encrypts both commands and data, preventing passwords and sensitive information from being transmitted openly over the network. It cannot interoperate with FTP software. Trivial File Transfer Protocol Trivial File Transfer Protocol (TFTP) is a simple, lock-step FTP that allows a client to get a file from or put a file onto a remote host. One of its primary uses is in the early stages of booting from a local area network, because TFTP is very simple to implement. TFTP lacks security and most of the advanced features offered by more robust file transfer protocols such as File 56 Transfer Protocol. TFTP was first standardized in 1981 and the current specification for the protocol can be found in RFC 1350. Simple File Transfer Protocol Simple File Transfer Protocol (the first protocol abbreviated SFTP), as defined by RFC 913, was proposed as an (unsecured) file transfer protocol with a level of complexity intermediate between TFTP and FTP. It was never widely accepted on the Internet, and is now assigned Historic status by the IETF. It runs through port 115, and often receives the initialism of SFTP. It has a command set of 11 commands and support three types of data transmission: ASCII, binary and continuous. For systems with a word size that is a multiple of 8 bits, the implementation of binary and continuous is the same. The protocol also supports login with user ID and password, hierarchical folders and file management (including rename, delete, upload, download, download with overwrite, and download with append).

Search This Blog

ICAR ASRB NET BIOINFORMATICS

Comments

Post a Comment

Popular posts from this blog

Unit 1 Computing

Database Systems (ICAR ASRB NET Bioinformatics Unit 3)

ICAR ASRB NET – Bioinformatics 2023 model paper