Comparative Genomic

63203f42f1191

Course : PG P athsh ala - B iop h ys ics P ap e r : B ioi n f or m atic s

M od u le: Co m p ar ative G e n o m ics

Conte n t Wr ite r : Dr. I m t a iy a z H a s s a n, J a m ia M illi a I s la m ia , New Delhi

In tr od u c tion :

C ompar a ti ve g e nomi c s i s a bra n c h of ge nomi c s in whic h the diff e re nt fe a tur e s suc h a s g e ne orde r, DN A se que n c e a n d re g ulator y se qu e nc e s pre se nt in the g e nomes of diver se or g a nism s a re c ompar e d . The a im of c o mpar a ti ve ge nomi c s is to a c hiev e be tt e r unde rsta n d ing of the pro c e ss o f e volut ion that lea ds to t he for mation of diff e re n t spe c ies a nd de ter mi na ti on of the func ti ons a ssocia ted with non - c o ding re g ions a nd un c ha ra c te riz e d g e n e s pre s e nt in the g e nomes. In c ompar a ti ve ge nomi c s, the whole se que nc e s o f the ge n omes a re a li g ne d a nd the ortholog ou s se que nc e s we re id e nti fie d a s a mea su re o f the c onse rva ti on. This a na l y s e s for m th e ba sis for infe rr in g the ge nomi c in for mation a nd the proc e ss of mol e c ular e volut ion. S ince the g e nomi c da ta is g e ne ra ti n g a t the e x pone nt ial ra tes, thi s fi e ld ha s be c ome more sophi sti c a ted in orde r to c ompar e thi s e ve r - inc re a sing da ta. Va rious c omput a ti ona l tec hniques a re a va il a bl e for the a na l y s e s of the ne x t g e ne ra ti on se que nc in g da t a whic h unc ove r the S NPs (sing le - nuc leotide pol y morphism s) a s we l l a s the inser ti ons - de le ti ons b y m a pping s c a tt e re d g e nomi c re a ds a long side a pre - a nnotate d g e nom e a s a re f e re n c e .

Ob je c tive : I n thi s mod ule w e discuss the f oll owin g fie lds in det a il ,

1. Ge nome se qu e nc in g

2. P ubli c a ll y a va il a ble Da ta ba se s for c ompar a ti ve ge nomi c s

3. Tools ava il a ble f or c omp a ra ti ve g e nomi c s

4. C omm on pr oblems i n g e nome a nnotation

1. G e n o m e se q u e n c in g :

I n the y e a r 1995, the fir st se que nc e d g e nomes of Hae mophilus inf luenzae a nd M y c oplas ma ge nit ali um wa s publi she d, whic h is follow e d b y se que nc in g of the fir st e uka r y oti c ge nome o f Sac c haromy c e s c e r e v isi ae . The s e findin g s l e a d to the c ompa ra ti ve a na l y se s of the g e nom e s. This

a na l y s e s lea ds to the c h a ra c t e riz a ti on of a round one - thi rd unc ha r a c te riz e d g e n e s pre se nt in the g e nom e s. Ove r the p a st fe w d e c a de s, the e x pone nti a l grow th in the n umber of c ompl e tel y se que nc e d, a nnotate d , a nd a ssembled g e nomes due to the int roduc ti on of the ne x t - g e n e r a ti on se que nc in g tec hniques. The re a re more than 31 6 P roka r y ot e s, 2 7 Ar c h a e , 280 Euka r y otes a nd 1600 Viruse s ge nomes that we re se qu e nc e d a nd de posi ted in the publi c a ll y a va il a bl e da taba s e s. The tre nds of e x pone nti a l i nc re a se in t he s e que nc e d g e nom e s a r e g iven in f i g ur e be low.

http://previews.figshare.com/1090780/preview_1090780.jpg

2. P u b li c all y avail ab le Dat ab ase s f or c o m p ar at ive ge n o m ics

The re a re a v a rie t y of publi c a ll y a va il a ble d a taba se s th a t provide f a c il it y fo r r e trie va l , manipulation a nd a na l y s e s of se que nc e d ge nome s. S e ve ra l re se a rc h g rou ps wor king in the a re a of g e nomi c s a re maint a ini ng d a taba se s to pr ovide im p orta nt infor m a ti on suc h a s op e ron orga niz a ti on, func ti ona l pre dictions, thre e - dim e n sional s truc ture , a nd meta boli c re c onst ru c ti ons. He re , we discuss som e of the importa nt data ba se s ,

P E DA NT (http: // pe da nt .g sf.d e )

I t provide s c ompr e he nsiv e a utom a ti c platf o rm for the a na l y sis g e nomi c se q ue nc e s b y a va ri e t y of bioi nfor matics tool s b y mea ns of a wide sp re a d W e b - ba se d c li e nt int e rf a c e . Ar ound 177 c ompl e tel y se que n c e d a nd incomplete g e nom e s ha ve b e e n a dmi nist e re d so fa r, a lon g with bi g

e uka r y oti c g e nom e s suc h a s that of mous e a nd human , whic h a re ne wl y publi she d. Th e P ED AN T re posi tor y ha s a va rie t y of nov e l ana l y ti c a l a tt ribute s suc h a s in c orpor a ti on of B ioR S da ta re c ove r y s y stem w hich pe rmit s ra pid tex t inqui r y , a lon g with a pr e - proc e ssed se qu e nc e c lust e rs in ev e r y e nti re ge nome a nd a c ompl e te s e t of so ftw a r e fo r ge nome e v a luation , to g e ther with the ge nome diver ge nc e table s a s we ll a s prot e in s func ti on s pre diction sub - s y stems that for m the ba sis on the g e no mi c c ontex t. F urthe rmor e , the visuali z a ti on a nd in sil ico a na l y s e s of prote in prote in int e ra c ti on da ta ba se d on li ter a tu re obtaine d fr om the e x pe rime ntations . Ar ound 650 000 prote in fr om diver se or ga nism s make P ED AN T a use ful platf or m for the a na l y se s o f ra w g e nomi c da ta obtain e d fr om re c e ntl y se que n c e d da ta. Th e homa g e of P ED AN T is shown i n fig ur e g iven be low

COG (C lu ste r s of Ort h ologou s G r ou p s of p r ot e in s) datab ase :

The c ohe r e nt c a t e g oriz a ti on of prote ins produ c e d in the se que nc e d ge no mes is sig nific a nt for c onst ruc ti ng the g e nome se que nc e s full y c onv e n ient for e volut ionar y a n d func ti ona l a na l y sis . The C OG s ( htt p:/ /www .nc bi.nl m.ni h.g ov/C OG ) re posi tor y is a n e f for t on a ph y lo g e n e ti c a rr a n ge ment of the pro teins obtaine d fr om 21 c ompl e tel y se q ue nc e d g e nom e s of a r c h a e a , ba c ter ia a nd e uka r y otes. This da taba se wa s c re a ted by usin g the prin c ipl e of unifor mi t y of

g e nom e - li mi ted fin e st hi ts to the outcome of a n in - de pth c omp a rison of e ve r y prot e in se qu e nc e s obtaine d fr om these se que nc e d g e nomes. The r e sourc e c ontains 209 1 C OG s that c omprise 56 83% of the g e ne produ c t s fr om c ompl e te a rc ha e a l a nd ba c ter ial ge nomes a nd while re st fr om the g e nom e of Sac c haromy c e s c e r e v isi ae The func ti on of the ne w prote in s we re pr e dicte d b y C OG N I TO R pro g ra m th a t fit these prote in int o a pprop ria te C OG s , whic h c a n be subm it ted in the we bpa g e a s g iven be low .

K E GG ( K yoto E n c y c lop e d ia of G e n e s an d G e n om e s):

KE GG is a publi c a ll y a v a il a ble re sour c e fo r c om pre he ndin g the hi g h - leve l role a nd e ff ic a c y of the biol og ica l s y stems b y using the mol e c ular - lev e l knowle dg e , pa rtic ular l y a siz e a ble mol e c ular da tase ts produ c e b y t he g e nome se que n c ing proje c ts a s we ll a s oth e r hi g h - throu g hput tec hnolog i e s. This se t of da taba se s c ontains info r mation re g a rding the ge n omes, disea se s, dru g s, biol og ica l p a thwa y s a s we ll a s the c h e mi c a l su bstanc e s. This s y stem is use d fo r a va ri e t y of bioi nfor matics studi e s r e se a r c h suc h a s meta ge n omi c s, g e nomi c s, meta bolom ics a nd s y stems biol og y . A t y pica l KEG G pa thwa y of a meta boli c pa thwa y is g iven be low .

http://ecoliwiki.net/colipedia/images/thumb/c/c9/KEGG_pentose.jpg/400px-KEGG_pentose.jpg

G OL D ( G e n o m e s On L in e Dat ab ase ) :

I t is a wide - ra n g in g onli ne da taba s e ( htt p:/ /www .g e nom e sonl ine.or g ) whi c h is de voted to index a nd obse rve g e ne ti c re s e a rc h wor ldwid e . I t provi de s c urr e nt posi ti on on t he c ompl e te a s we ll a s ong oin g se que n c ing s c h e mes including a bro a d c oll e c ti on of proc e ssed meta da ta. The c u rr e nt ve rsion is 5 whic h provi de a n int e rf a c e to a bout 19 200 studi e s, 56 000 s e que nc in g p roje c ts, 56 000 B iosampl e s a s we ll a s 39 400 investi g a ti o n proje c ts. T y pic a l tool s a va il a ble a t GO L D we bpa g e a re sho wn in f i g ur e g iven be low.

M icr ob ial G e n o m e Dat ab ase (M B G D ) : M B GD is a c onve nient tool for s e a rc hin g for li ke l y homol og s a mon g a ll se q ue nc e d mi c robi a l genom e s whic h is m a int a ined b y the Univ e rsit y o f Tok y o , J a pa n. I n c ontra st to C OG s, MB GD a ssi g n s hom olog y re lations hips ba se d sol e l y on se que nc e sim il a rit y ( B LA S TP va lues of 10 - 2 or le ss).

Oth e r ge n o m e d atabase : He r e the lis t of other g e nome da taba se is provid e d, (Sourc e : htt ps:/ /www 2.infor matik.hu - be rlin.de /~ha k e nbe r/li nks/ modelor g .htm l )

S ac c h ar o m yc e s Geno m e Dat ab ase : S GD TM is a sc ientific da tab a se of the molec ular biol og y a nd g e n e ti c s of t he y e a st Sac c haromy c e s c e re v isi ae .

F lyb ase : A Da tab a se of the D rosophila G e nome

G DB - G e n o m e Dat ab a se : An inter n a ti ona l collabor a ti on in s uppor t of th e Huma n Ge nome Proje c t

HUG O - Huma n C hrom osom e s a nd Mi tochondr ion

M G I - Mous e Ge nome Inf orma ti c s: I nte g r a ted a c c e ss t o da ta on the ge ne ti c s, g e nomi c s, a nd biol og y o f the la bora tor y mous e .

T air - The Ar a bidops is I nfor mation R e sourc e : P r ovides a c ompre h e nsive r e sourc e for the sc ientific c omm unit y wor king with Arabidops is t hali ana

RG D - R a t Ge nome D a ta ba se : Rattu s norve gicus . C ura tes a nd int e g r a tes r a t g e n e ti c a nd ge nomi c da ta.

ZF I N - Z e br a fish I nfo rma ti on Ne twork Danio re r io .

Wor m B ase : I n for mati on c onc e rnin g the ge ne ti c s, ge nomi c s a nd b iol og y of

C ae norhabdi ti s e legans a nd som e re late d n e matod e s

AC e DB: C ae norhabdi ti s e legans Ge ne ti c s a nd Ge nomi c s

T u b e r c u L ist: C oll a tes a nd int e g r a tes va rious a spe c ts of the g e nomi c info r mation fr om

M . tuberc ulosi s , a s we ll a s M . afri c anum, M . bov is , M . bov is BC G, M . c an e tt i , a nd M . mic roti

Vir Gen: VirG e n a c omp re he nsive vir a l genome r e sourc e

c oli B ase : An online da ta ba se for E. c oli , Salmone ll a a nd Shigella c ompar a ti ve g e nomi c s

E . c oli Ge n om e Projec t : (stra ins K12 MG1655, O157:H7 E D L 933 a nd U P EC C F T0 73)

G e n oB ase : E. c oli stra in K - 12 (W 3110) L ist of E . c oli da ta ba se s a nd re sou rc e s

M oll iG e n : A da t a ba se de dica ted to the c ompar a ti ve g e nomi c s o f Molli c utes; M y c oplas ma ge nit ali um, M y c oplas ma pne u moniae , Ure aplas ma ure alyticu m - parv um, M y c oplas ma pulmonis, M y c oplas ma pe n e trans a nd M y c oplas ma galli se pti c um

Cr yp toD B : C r y ptoD B provide s a c c e ss t o the g e nome da ta f or th e a picom plex a n pa ra sit e C . parv um

d ict yB ase : A ne w Di c ty ostelium di sc oideum g e n ome da taba se

2. T ools a vail ab le f or c om p ar a tive ge n o m ics B L AST 2:

This softwa re p e rf o rms a li g nment b e twe e n tw o g e nom e s. This is a va il a b le a t NCB I for publi c usa ge . I t tak e s g e nomi c se que nc e s in the fo rm of F ASTA s e que nc e a nd p roduc e d the output in the f o rm of g r a phica l vi e w.

M UM m e r

This is a f re e a c c e ssi b le tool (http: // mum mer .sourc e fo r g e .ne t/ ) for the g e nomi c a li g nm e nt de ve loped b y S tev e n S a lz be rg’ s g roup a t T I G R . The MUMmer pe r f orms a ba s e to ba s e a li g nment a nd produ c e d the hig hli g ht e d output re pre se nti n g e x a c t matc he s a s we ll a s diff e r e nc e s in the a li g ne d g e nomes a nd loca tes the S NPs, sig nific a nt r e pe a ts, lar ge inser ts a nd Ta nde m re pe a ts. Th e outl ine of th e MUMmer methodo a l o gy is pre s e nted in the g ive n flow c ha rt .

The proc e ss of g e nomi c a li g nment is pre s e nted be low

Co m p ar ative G e n o m e Anal ysis T ool (C G AT). C GA T is de v e loped for a de tailed c ompar ison of c losel y r e late d ba c t e ria l - siz e d g e nomes. C GA T ha v e opti on to v isualiz e pre - c omput e d pa irw ise g e nome a li g n ments. Use rs c a n put se ve ra l info rma ti on s duri ng a li g nment, suc h a s e x ist e nc e of tande m re p e a ts or int e rspe rs e d re pe ti ti ve se que nc e s a nd c ha ng e s in c odon usa ge bias, to fa c il it a te int e rpr e tation of the obse rve d ge nomi c c h a n g e s. B e sides visuali z a ti on, C GA T a lso provide s a ge ne r a l fr a mew o rk to p roc e ss ge nome - sc a le a li g nments using v a rious e x ist ing a li g nment pr o gra ms. C G AT is ava il a ble on htt p:/ /m bg d. g e nome. a d.jp/ C GA T/ .

S yn tTax - is a we b s e rv e r li nking s y nt e n y to pro ka r y oti c tax onom y . S y ntTa x incor pora tes a full hier a r c hica l tax onomi c t re e a ll owin g int uit ive a c c e ss to a ll c ompl e tel y s e que nc e d proka r y ote s (A rc h a e a a nd B a c te ria ). ( R e fe re n c e : Ob e rto J . 2013. B MC B ioi nfor matics. 14: 4).

AutoG RA P H : is a n int e g r a ted we b s e rve r fo r m ult i - spe c ies c ompa ra ti ve g e nomi c a n a l y sis . I t is de sig ne d for c onst ruc ti ng a nd visu a li z ing s y nten y maps b e twe e n two or thre e sp e c ies, de ter mi na ti on a nd displ a y of mac ros y nten y a nd mi c ros y nten y r e lations hips a mong spe c ies, a nd for hig hli g hti n g e volut ionar y bre a kpoint s. (Re f e re nc e : De rr ien T e t a l. 2007. B ioi nfor matics 23:498 - 499).

Sourc e : J C li n Inv e st. 20 03 Apr 15; 111 ( 8 ) : 109 9 1106

B ASys B ac te r ial Ann otat ion T ool - T his i nc re dibl e tool supports a utom a ted, in - de pth a nnotation of ba c te ria l ge nomi c se que nc e s. I t a c c e pts ra w DN A se qu e nc e da ta a nd a n opti ona l li st of g e ne identific a ti on infor mation a nd provide s e x tensive textua l a nnotation a nd h y p e rlinke d im a g e output . (Re fe re n c e : G.H . V a n Domse la a r e t a l. 2005. Nu c l. Ac ids R e s. 33(W e b S e rve r iss ue ):W 455 - W 459).

M AK E R W e b Ann otat ion S e r vice (M WAS) : is a n e a sil y c onfi g ura bl e w e b - a c c e sibl e g e nom e a nnotation pipeline. I t ' s purpo se is to a ll ow re se a rc h group s with small to int e rme diate a mount s of e uk a r y oti c a nd prok a r y o ti c ge nome s e que n c e t o indepe nde ntl y a nnotate a nd a na l y se their d a ta a nd produ c e output that c a n be loade d int o a g e n ome da taba se . (R e fe re nc e : Holt, C . & Ya nd e ll ,

M. 2011. B MC B ioi nfor matics 12:491).

F L AN ( F L u AN n otat io n ) : is a n NCB I w e b se rv e r for g e nom e a nnotation of influe nz a virus is a tool for use r - provide d influe nz a A virus or in flue nz a B virus se qu e nc e s. I t c a n v a li da te a nd pre dict prot e in se que n c e s e n c ode d b y a n input f lu se que nc e . (Re fe r e nc e : Y. B a o e t a l. 2007. Nuc leic A c ids R e s. W e b S e rve r issue ) 35: W 280 - W 284.)

4 Co m m on p r ob le m s in ge n om e an n otat ion

Ge nome is a ve r y c ompl e x c omponent of biol og ica l s y st e m he nc e the g e n ome a nnotation de fie s full a utom a ti on a nd is i nhe re ntl y e rr o rpr one . D e ve lopm e nt of the se mi a utom a ted a nnotation s y stems a nd the a pprop r iate tra ini n g o f a nnot a tors c a n he lp to re du c e the a c c identa l e r ror ra te during g e nome a nnotat i ons. The r e a re , howe v e r, se v e r a l sourc e s o f s y stema ti c e r ror th a t pe sti lenc e g e nome a na l y s is.

In c om p let e In f or m atio n in Dat ab ase s: Ea c h da taba se s ha v e the potential for noise a mpl ific a ti on durin g se a rc he s , so that the o rig i na l a nnotation c ould h a ve invol ve d a mi nor inac c ur a c y o r incomplet e ne ss. I ts tr a nsfe r on th e ba sis of se que n c e sim il a rit y a gg ra v a tes the proble m and e ve ntu a ll y r e sult s in o utrig ht fa lse f u nc ti ona l assig nments .

F alse P ositiv e s d ata se ar c h e s: D ist ributi on of sim il a rit y sc or e s fo r e volut ionar il y a nd func ti ona ll y re lev a nt se q ue nc e a li g nments is v e r y broa d a nd that a c onsi de ra ble f ra c ti on of them fa il the E - va lue c utoff , re sult ing in unde t e c te d re lations hips a nd mi ssed oppor tuni ti e s fo r func ti ona l pre diction ( fa l se ne g a ti ve s).

Organi sm al Conte xt as a S ou r c e of E r r or s: Ge n omi c infor mation of a n orga nism ma y se rve a s sourc e s of im porta nt func ti ona l infor mation. How e ve r, those infor mation ma y be mi sint e rpr e ted a nd be c ome on e of the m a jor sour c e s o f e r ror a nd c onfusion in genome a nn otation.

S u m m ar y :

I n thi s modul e y ou ha ve studi e d the proc e ss of th e c ompar a ti ve ge nomi c s. The se que nc in g of th e g e nom e s pre s e nt in the 1 990s e na bles th e c ompa r a ti ve c ompa rison of the g e nom e s whic h e na ble us to find the func ti ons of va rious unc ha r a c te riz e d g e n e s. W it h the re volut ion in the se que nc ing tec hniques t he number of se que nc e d ge nomes incr e a s e s e x pone nti a ll y . D a taba se s a nd tool s that c onsi stentl y or g a niz e th e ge nomi c da ta a c c ordi ng to ph y lo g e ne ti c , fun c ti ona l, or struc tura l princ ipl e s a nd e x pli c it l y take a dva nta g e of the d iver sit y of g e nomes to incr e a s e th e re solut ion powe r a nd robustne ss of the a n a l y se s. This lea ds to the sophi sti c a t ion in the studi e s of c ompar a ti ve g e nomi c s. A va rie t y o f publi c a ll y a va il a ble da taba se s suc h KE GG , GO L D, C OG s a nd man y more , whic h provide the a ssessing a nd a na l y se s of va riou s se que nc e d g e nomes

possi ble. F urthe rmor e , numer ous B ioi nfor m a ti c s tool s a re a lso de v e l ope d that make the a li g nment be twe e n big ge nomes possi ble a nd e na ble the identific a ti on of S NPs a nd c onse rve d re g ions in pa rtia ll y se qu e nc e d ge nomes. T h e ult im a te suc c e s s of c ompar a ti ve g e nome a na l y sis a nd a nnotation c ritica ll y de pe nds on c ompl e x de c isi ons ba se d on a va rie t y of input s, including the unique biol og y of e a c h org a nism .

E n d o f M od u le 7 T h an k yo u

Comments

Popular posts from this blog

Database Systems (ICAR ASRB NET Bioinformatics Unit 3)

ICAR ASRB NET – Bioinformatics 2023 model paper