Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
AGCoL

  System Guide

Introduction Requirements Install SyMAP Running the Demo New Project General

  Go to top


       UA BIO5        SyMAP
Home
  
Download   Docs          System
Guide
  
Input   Parameters          User
Guide
  
Queries

This document discusses building a SyMAP v5 database. It always applies to the latest release.
Referenced external docs 1. Introduction 2. Requirements 3. Install SyMAP 4. Demo 5. New projects and synteny 6. General
Transcriptome Analysis and Comparative Transcriptomes, see TCW.

1. Introduction

Overview Publications Steps for finding synteny Go to top

Overview

SyMAP is a system for computing, displaying, and analyzing syntenic alignments between divergent eukaryotic genomes. It is designed for the comparison of a few genomes at a time (i.e. 2-4) where synteny is computed between each pair.
Its features include the following:

Compute

  • Find synteny between two sequenced eukaryotic genomes with optional annotation.

  • Order a draft genome against a fully sequence genome (not draft-to-draft).

  • Self-synteny.

Query and view

  • For multiple selected synteny pairs, display using dot plot, circular, and side-by-side.

  • Query annotation, collinear genes, multi-hit genes, etc.
Click an image to see the closeup.

viewPlants
dotplot-3-genome circle-2-genome 2d-3chr

Publications

SyMAP is freely distributed software, however if you use SyMAP results in published research, you must cite one or both of the following articles along with the external program MUMmer1,2.
        C. Soderlund, M. Bomhoff, and W. Nelson (2011)
        SyMAP: A turnkey synteny system with application to plant genomes.
        Nucleic Acids Research 39(10):e68.

        C. Soderlund,  W. Nelson, A. Shoemaker and A. Paterson (2006)
        SyMAP: A System for Discovering and Viewing Syntenic Regions of FPC maps
        Genome Research 16:1159-1168.
The back-end processing of SyMAP runs MUMmer1,2 for the alignments (included in the tarball) and computes the synteny block from the alignment results. The SyMAP synteny algorithm is described in the above two publications, though there are many unpublished updates since publication.

Steps for finding synteny

The following three scripts are provided in the tar file.
 ./xToSymap Format files from NCBI and Ensembl into a SyMAP friendly format.
 ./symap Build the SyMAP synteny database; view and query
 ./viewSymap View and query the database results

Follow the steps below to get started with SyMAP.

1. Use Linux or MacOS See system requirements.
2. Requirements Set up Perl, Java and MySQL .
3. Install SyMAP It is a simple unzip; see Installation and SyMAP MySQL parameters.
4. Run the demo Highly recommended. See running the demo.
5. Prepare input files FASTA sequence and optional GFF annotation. See Input.
6. Load the files into SyMAP For a project, set project parameters, then select Load project.
See New project and Load project.
7. Compute alignments and synteny For a selected pair, set pair parameters, then select Selected pair.
See New project and Align&Synteny.
8. View results See User Guide for a details of viewing and querying the results.

2. Requirements

System
Requirements
Perl, Java,
MySQL
Disk
space
Tested
platforms
Datasets
and Timings
Go to top

System Requirements

Basic knowledge: This documentation requires a basic knowledge of Linux. The documentation and SyMAP interface assumes a knowledge of the Linux directory (folder) structure, as used by a terminal application.

The machine must be a Linux or MacOS 64-bit machine with sufficient memory for your dataset.

The largest component of SyMAP execution time is running MUMmer1,2.

  • The time and memory for MUMmer depends on the genomes sizes, complexity and similarity.

  • If MUMmer fails, it is often due to insufficient memory; see the MUMmer webpage, which explains how to determine the problem and ways around it.

  • For large genomes, it is essential that the machine has at least 6Gb of RAM and disk space for each CPU used.

  • If gene discovery is not important, then masking all but the gene sequences greatly reduces the time and memory. See Masked below.

  • See Disk space for an idea of needed disk space. See Tested datasets and timings to get an idea of compute times.

If SyMAP runs out of memory, see Trouble Shoot.

For viewing alignments with viewSymap, CPU and memory needs are typically negligible, unless you are performing queries on more than 4-5 genomes at once.

Perl, Java and MySQL

Perl: This is for MUMmer; see MUMmer manual, section on Software Requirements. It states that the following are required: Perl5 (5.6.0), sh, sed, awk (the last 3 are standard on any linux-based machine).

Java: You must have Java version 17.0.11 or later. The released symap.jar file has been compiled with Java 17.0.11, which is upward compatible.

MySQL: If your machine does not have MySQL or MariaDB, download and install it. For example, MySQL can be downloaded from dev.mysql.com. On a personal MacOS, simply download the '.dmg' file and following the instructions. On a work server, the system administrator may need to install it.

Important Note: The default settings of MySQL are poorly suited for large-scale data storage. You will want to adjust the parameters innodb_buffer_pool_size and innodb_flush_log_at_trx_commit as described in Trouble Shoot MySQL.

Disk space

When MUMmer is running, the temporary files can take up to 6G of disk space, hence, if you have 12 CPUs/threads running, this could use 72G of disk space.

MUMmer produces a .mum and .delta file; SyMAP only uses the .mum file so it removes the MUMmer alignment .delta file. If you do not want SyMAP to remove the .delta, use the "-mum" command line argument.

The following are sizes of MUMmer result data/seq_results/<species1_to_species2/align directory:

SpeciesGenome sizes /align size
Arabidopsis thaliana x Brassica oleracea119Mb x 447Mb23M
Homo sapiens x Mus musculus3Gb x 2.7Gb573M
Homo sapiens x Pan troglodytes3Gb x 3Gb12G*
*These closely related species result in many hits!

You can remove the data/seq_results/<project1_to_project2>/ directory after SyMAP has finished the synteny computation, but it is strongly recommended that you leave them if you have the space. There are frequent SyMAP updates with improvements to the clustering and synteny computations; if you have kept these files, then you can update your database in very little time, e.g. Hsa x Mus took >30h for MUMmer versus 1m:37s for the synteny. Also, you can easily try different parameters for the clustering and synteny stages.

The alignments can be compressed while not in use. The following was executed from the data/seq_results/Hsa_to_Pan directory, where the /align files took 12G of disk space:


tar -czf align.tar.gz align
rm -rf align				# this removes the directory
The resulting align.tar.gz file take 3.5G of space.

Tested platform

SyMAP has been tested on the following:
MachineMySQLJavaCore (CPU)MemoryPurchased
v5.6.6 and later:
  1. MacOS M4 MySQL v8.0.42 24 from Oracle M4 12-Core 48Gb2025
v5.4.1 and later:
  2. MacOS x86_64 MySQL v8.0.33,
MySQL v8.4,
MariaDB 11.0.2
8, 15, 17, 18, 20 from
Adoptium and Oracle
3.2 GHz 6-Core 64Gb2018
  3. Linode (Ubuntu 22.04.2 LTS) MySQL 8.0.3317Nanode  1Gb2023
v5.4.0 and earlier:
  4. Linux amd64 (Centos) MariaDB v10.4.12 1.8 2.3 GHz 24-Core 128Gb 2011

Tested datasets and timings

Datasets

The following datasets have timings reported for them.

MammaliaRosales
  Homo sapiens (Hsa)24 chrs, 3Gb   Prunus persica (Peach)8 chr, 227Mb
  Pan troglodtes (Pan)25 chrs, 3Gb   Prunus yedoensis (Pyedo)250 scaffolds, 408Mb
  Mus musculus (Mus)21 chrs, 2.7Gb
 
BrassicalesPoaceae
  Arabidopsis thaliana (A.thal)5 chr, 119Mb   Oryza Sativa (Rice)12 chr, 373Mb
  Brassica rapa (B.rapa)10 chr, 297Mb   Zea Mays (Maize)10 chr, 2.1Gb
  Brassica oleracea (B.oler)9 chr, 447Mb   Triticum aestivum (Wheat)22 chrs, 14.5Gb
 

Timings

Most of the timings were on MacOS M4.

Times These are from the SyMAP symap.log file, which uses the Java system time functions (clock times are greater than the Java CPU system times). Times can vary over multiple executions; timings for only one execution is shown.
CPU Unless stated otherwise, the following #CPU was used: (1) 4 CPUs with amino acid alignment,
(2) 6 CPUs with self-synteny nucleotide alignment (NT). As mentioned earlier, it is important that the machine has at least 6G of RAM and disk space for the number of CPUs used; see CPU.
Memory SyMAP inputs chr x chr into MUMmer, so the largest chromosome size will have the biggest impact on memory. Concat concatenates small chromosomes; in Notes below, !Concat indicates this was not performed so that input to MUMmer was smaller files; self-synteny always uses chr x chr.
  1. MacOS M4 with 48Gb has been tested with the following:

    SpeciesMUMmerSynteny    Size Notes
    hr:minmin:sec 
    Hsa x Mus 30h:45m 5m:13s    3Gb x 2.7Gb 504 alignments, !Concat
    Hsa x Mus a 13h:00m 0m:49s    3Gb x 2.7Gb MASKED, 504 alignments, !Concat
    Hsa x Mus b 8h:03m0m:37s    3Gb x 2.7Gb MASKED, 84 alignments, v5.8.1
    Hsa x Pan 22h:48m
    16h:40m
    9m:31s
    9m:23s
     3Gb x 3Gb MASKED, closely related, 600 alignments, !Concat
     84 alignments, v5.8.1
    Hsa x self 7h:45m8m:28s  3Gb x 3Gb 276 alignments, NT
     
    Peach x Pyedo0h:44m 1m:29s  227Mb x 408Mb draft Pyedo ordered by peach
     
    A.thal x B.oler0h:31m 1m:14s  119Mb x 447Mb
    B.raba x B.oler1h:25m 4m:12s  297Mb x 447Mb closely related
    B.raba x self 0h:16m 1m:17s  297Mb x 297Mb NT
     
    Wheat x Rice c6hr:11m3m:32s  14Gb x 373Mb MASKED, CPU 1, largest chr 801Mb, !Concat
    Maize x Rice5h:59m1m:16s  2.1Gb x 373Mb
    Maize x self d>48hr 28m:52s  2.1Gb x 2.1Gb duplications, NT
    Maize x self0h:16m 5m:03s  2.1Gb x 2.1Gb MASKED, duplications, NT

    Footnotes:

    1. By masking everything but the genes, MUMmer only took 13h vs 30h:45m and produced equivalent results, albeit, without possible gene discovery; i.e. all hits link two genes. To mask the genes, select the masked checkbox on Mask parameter.

    2. Using v5.8.1 with concatenation, MUMmer took 8h vs 13h and produced the same results.
      MUMmer would fail on my Mac M4 (48G) with input files of 3Gb (i.e. Hsa).
      For SyMAP v5.8.1, Concat was modified to create concatenated files of size <1Gb. Hence, MUMmer finished and the time for alignments was greatly reduced.
      The v5.8.1 unmasked alignments (not shown in table) took 19m:23m vs 30h:45m .

    3. The largest wheat chromosome is 801Mb; in order for this to run on this relatively small machine:
      (a) Concat uncheck, (b) both genomes Masked, (c) MUMmer V4 was used, (d) 1 CPU.
      When CPU was set to 4, the machine hung because it ran out of application memory due to processing very large chromosomes in parallel.

    4. The exact time for running MUMmer on Maize x self is approximate; there were 55 alignments that took anywhere between 1-5 hours each; I did not run it all at once. Maize has a lot of duplication, hence, the longer time (compared with Hsa x self).

  2. MacOS x86_64 with 64Gb was tested on the following:
    SpeciesMUMmerSyMAP
    A.thal x B.oler         0h:33m  4m:06s
    Note the longer time for the synteny computation on the older Mac machine.

  3. Linode nanode with 1Gb was too small to run MUMmer, so the MUMmer demo result files were transferred to the data/seq_results/demo_seq_to_demo_seq2 directory. This allowed all other features to be tested on the demo, including running the synteny algorithm without the alignment. Also, two tiny input files were used to test MUMmer.

  4. Linux amd64 with 128Gb was used extensively on large plant genomes.
    For example, to align Maize x Rice used a total of 1h:3m using 8 CPUs

3. Install SyMAP

Tarball Externals executables MySQL parameters Go to top

Tarball

Installation consists of unzipping the download tarball using the command
     > tar -xf symap_5.tar.gz
This can be done anywhere and it creates a directory called symap_5. You can move this directory later if desired. The contents are:
   LICENSE   README   data/   ext/   java/
   scripts/  symap.config     symap  viewSymap  xToSymap
Data: The data/ directory contains a seq/ sub-directory, which contains the demo files, and is the default location for all input sequence files. Symap expects to find the data directory, viewSymap does not.

External executables

The ext/ directory contains the external programs MUMmer1,2 for sequence alignment, and MAFFT6 and MUSCLE7 for interactive MSA alignment (for Queries). The directory contains:
	README		mummer/		mummer4/	muscle/		mafft/
Each has subdirectories:
SubdirectoryOS (Architecture)Note
lintel64Linux
macMac OS X (x86_64)
macM4Mac OS X (M4 silicon)No muscle executable

SyMAP will determine which subdirectory to use.

If you compile your own executables for a different machine (architecture), do the following:

  1. Under mummer and mafft, make a directory with your machine name.
  2. Put the executables under this directory in the exact same way as shown for lintel64.
  3. In the symap configuration file (default symap.config), add a line
         arch={your directory architecture name}
Note that I do not have an alternative machine to try this on, but it should work. Email me at cas1@arizona.edu if it does not.

For MUMmer, see Executables and Using MUMmer4. On MacOS SyMAP may fail running MUMmer, if it does, see MacOS externals.

MySQL parameters for SyMAP

Parameters for accessing the MySQL database should be set in the symap.config file in the main symap directory, as follows:

Database Parameters
db_name Name of the MySQL database, which SyMAP will create when it first reads symap.config. It is standard to start the name with symap, e.g symapDemo.
db_server The machine hosting the MySQL database, e.g. myserver.myschool.edu. If using your local machine, enter localhost.
db_adminuser MySQL username of a user with sufficient privileges to create a database. It is also necessary for loading, deleting and running synteny.
db_adminpasswd Password of the admin user.
db_clientuser Optional: MySQL username of a user with read-only access. This is only necessary if you want a machine to run viewSymap as read-only.
db_clientpasswd Optional: Password of the client user (if db_clientuser is non-blank).

Example symap.config.

  db_name             = symapDemo
  db_server           = localhost
  db_adminuser        = <adminid>
  db_adminpasswd      = <password>
  db_clientuser       =
  db_clientpasswd     =
To use an alternative file than symap.config, use the "-c" command line argument, e.g.
  >./symap -c symapTmp.config
This is useful if you have multiple SyMAP databases.

4. Demo

Running the demo Two genome synteny Draft ordering Self-synteny Go to top

Running the Demo

If you have not used SyMAP before, it is essential to run the demos. After you have installed MySQL, do the following:
  1. Change into the symap_5 directory.

  2. Edit symap.config and enter database and host information (see MySQL).

  3. From the command line, type ./symap.

The first time you run SyMAP, it will create the database with information written to the terminal, e.g.

Creating database 'symapDemo' (jdbc:mysql://localhost/symapDemo?characterEncoding=utf8).

It will check your MySQL variables; if there are any "Suggested" changes, see Trouble Shoot MySQL.

It will also check that the provided external programs (e.g. MUMmer) are executable; if it shows any problems, see Executables. For MacOS, you may also need MacOS externals.

Demo two genome synteny

Executing ./symap will bring up the Manager panel as shown on the lower right; it will show the three demo projects provided with the SyMAP tarball.
 
Check Demo-Seq and Demo-Seq2 and they will be displayed on the right panel.

A link Load All Projects will be displayed at the top of the right panel; select it to load the projects, which will take several minutes. If loading the Demo-Seq takes more than a few minutes, you may need to adjust the MySQL parameters, see TroubleShoot MySQL.

Project Manager
demoLoad
 
When done, the Manager will look like the image on the right. Your may verify the results by selecting the View link.

In the Available Syntenies table, the cell for Demo_Seq2 and Demo_Seq will automatically be selected (green cell).

Click the Selected Pair button to start the Align&Synteny.

demoMgr
 
The Align&Synteny takes less than 5 minutes on the MacOS x86_64 but could take up to 30 minutes on a slow machine.

When done, the table will have a checkmark (✓), signifying that the synteny is available for viewing.

demoTable
 
From the Report... pull-down, select Summary to view the summary shown on the right; there may be slight differences in the #Cluster hits because of different numbers of CPUs, MacOS vs Linux, etc (but the #Blocks come out the same). The resulting Dot plot is shown in the Demo Results .

 

Once the alignments are computed, the Align&Synteny parameters can be experimented without having to redo the alignments. This is done by changing the options on the pair Parameters panel.

See Demo Results for the results from using the Cluster Algo1 (modified original) and Synteny Original.

demoSum Algo2 Strict

Demo draft ordering

From the Manager left side, select Demo-Draft and Demo-Seq2. Load Demo-Draft.

Open the Parameters panel; at the bottom, select Draf->Seq2 and uncheck Strict. It is recommended to use Cluster Algo1.

order parameters

Run the Align&Synteny, where the alignment should take less than 10 minutes with one CPU.

When done, open the Summary for the pair which will be similar to what is shown on the right; as mentioned above, there may be slight difference in the number of Cluster hits.

See the first dot plot in Demo-draft.

 

 

A new project called Demo_Draft.Demo_seq2 has been created. It is shown on the Manager left side (see image on lower right).

Load Demo_Draft.Demo_seq2.

Run the Align&Synteny with the Demo_seq2 and Demo_Draft.Demo_seq2 projects, which will produce the synteny with ordered draft sequences.

See the second dot plot for Demo-Draft.

demo Sum Draft

demoDraft3

Using the project's Parameters panel, the Demo_Draft.Demo_Seq2 display name can be shortened.

Demo self-synteny

To perform self-synteny, select the cell for Demo_seq row and column (it turns green). The default for synteny computation is Strict, which results in zero blocks for this project. Hence, open the Parameters panel, unselect Strict, then Save. Selected Pair from the Manager. self
This computes a few tiny 'iffy' blocks, as described in Demo self-synteny. See A.thal self for a better example.

5. New projects and synteny

Load
project
Align&
synteny
Draft
ordering
Self
synteny
Cancel Go to top

Database name: In the symap.config file that comes with the tar file, the parameter db_name (database name) is set to symapDemo. Edit the symap.config file to set the database name to something more meaningful. It is protocol to start the database name with 'symap'.

The following provides an outline of building a synteny database, referring to the webpages that contain the specific information.

Load project

Input files: See Input for an explanation on the input files and how to define their location to symap.

Load project: Assume that the projects foo and bar were created as described in Create.

  • See project parameters on setting the parameters; it is important to get these correct before running A&S.

  • Then select Load project. See Load for details.
load

Align&Synteny

align and synteny

Suggestion for initial fast results:

  • Select the Mask option for both sequences in pair parameters before aligning.
  • This should perform a fast alignment.
  • Then, if the initial results look good and gene-discovery is desired, Clear Pair and redo without the masking.

Results: The result files are in the following directory:

   /data/seq_results/<project1>-to-<project2>/align
As mentioned in Disk, after the database is complete, these can be removed. However, sometimes SyMAP version updates require the project files to be reloaded and/or the synteny to be recomputed; if these files remain, the existing MUMmer files will be used, which saves a lot of time.

The log files are in the /logs directory, see MUMmer log files for more details.

See Using MUMmer with SyMAP for a discussion on how it works in SyMAP, trouble-shooting, and running MUMmer externally (i.e. if your local machine does not have enough memory, you may need to run it on a bigger machine).

Draft alignment and ordering

Draft contigs are unordered and unoriented contiguous DNA sequences. They can be ordered and oriented against a closely related complete sequenced genome using the following approach.

  1. It is a good idea to first try Demo draft ordering.

  2. Load project for the draft contigs and related complete genome sequence.

  3. Open the pairs's Parameters panel.
    Order against: At the bottom of the panel, select the radio button that indicates ordering your draft against a complete sequence (e.g. Draf->Seq2).

    Use Cluster Algo1 and do not use any Synteny Blocks options.

    order parameters

  4. Run Align&Synteny on the pair.
    This orders the draft contigs and creates a new project.

    The new project is named from the two project-names separated by "..". The new project will appear on the left panel of the Manager.

    demoDraft3

  5. Load project for the new project.
    Select the new project and ordered against complete sequenced genome.

  6. Run Align&Synteny.

The ordering algorithm creates the following files and directories:

1. File of ordered draft sequences: It writes the order of the contigs along with whether they should be flipped to a file called /data/seq/<draft-project>/<complete-project>_ordered.csv.

2. New project: It creates a new project directory called data/seq/<draft-project>..<complete-project> (the naming allows the draft to unambiguously be ordered against different genomes). This directory will contain a sub-directory /sequence containing the FASTA file and /annotation containing the gap file.

3. FASTA sequence file of ordered contigs:

  • Any contigs matching the Order against genome will be assigned the same chromosome name.
  • All contigs aligning to an Order against chromosome will be appended together in order with 100 N's between each scaffold. The contigs will be flipped as appropriate.
  • Any extra contigs will be put in ">Chr0".

4. Gap file: An annotation directory with a ".gff" file that specifies where the gaps are.

If the draft sequence is in too many sequence contigs, it takes a long time for the MUMmer comparisons. Also, the displays are very cluttered to the point of unreadable, but these will generally be merged with the new project synteny, so the new display is fine. Nevertheless, you may want to remove the smallest contigs. This can be done by limiting the number of sequence contigs by setting Minimum length in the project's Parameter panel to only load the largest 150 sequences. To determine the minimal length, use the Lengths button in the xToSymap interface, which will print out all the lengths; set the Minimum length to the 150th length.

Self-synteny

To perform self-synteny, select the cell for the same project (it turns green) followed by Selected Pair. self
By default, SyMAP uses the MUMmer 'NUCmer' program for self-alignments. Each chromosomes is compared to every other chromosome including itself.
  • Chromosome to itself: The Align&Synteny Parameters panel has an option to set Self Args, which is only used when comparing the chromosome sequence file to itself.
  • Make sure that the Cluster Hits Algorithm 1 option is selected.

A better demonstration than the demo is to download Arabidopsis thaliana from NCBI, convert it with the NCBI convert script, and run the self-synteny. It took 16 minutes with one processor on a Mac Mini (2018) with 64Gb main memory. The dot plot is shown on the right (click on the image for a closeup view).

The Dot plot is symmetric, with the same block on both sides of the diagonal. For self-synteny query and display, see Self-synteny

arab self

Cancel

The Load and A&S methods have a popup progress panel, as shown on the right. There is a Cancel button on the bottom that can be clicked to cancel the execution; it will remove the results from the database and exit.

Occasionally, the Cancel will cause it to create an error, writing to the error.log or to the terminal. This is not a problem, though you may need to remove the results yourself.

If MUMmer is running when you Cancel, make sure there is "Error: Failed command:" line to terminal for each MUMmer alignment that was running; if there is not, use the "top" linux command to view the running processes and stop any MUMmers still running.

self
Also, see Trouble Shoot Hang

6. General

Update How SyMAP works FPC        References Go to top

How to update SyMAP with a new release

If you have been working with SyMAP and have existing projects:
  • If the symap.jar is available from the download site and there are only changes to it, download it and replace the one in symap_5/java/jar.
or if there are changes to more than the jar file:
  • Put the new symap_5.tar.gz in a permanent location and untar it.
  • Replace the /data and symap.config from your previous SyMAP location to this new location.
  • This approach is safest as it acquires all changes (e.g. scripts) except for changes to the demo files.
or
  • Put the new symap_5.tar.gz in a temporary location and untar it.
  • Move symap_5/java/jar/symap.jar to the java/jar location of your permanent SyMAP.
  • Check to see if there are any /scripts or /ext changes that need to also be copied over.
The Align&Synteny will use existing MUMmer files if they have not been removed.

How SyMAP Works

This section provides a brief overview of the SyMAP processing steps; for more, see the SyMAP published papers4,5. The processing has four phases:
Alignment:
The sequences are written to disk*, with gene-masking if desired. In the alignment, one species is "query" and the other is "target". The query is the one with alphabetically the lesser name (e.g. A<B). The query sequences are written into one large file, while smaller target sequences are grouped into larger FASTA files of size up to 60Mb, for more efficient processing in MUMmer. There is an option Concat that if unchecked, the query sequences are treated the same as the target; i.e. generally there will be more sequence files to processed, but they will be smaller. See Concat for a description and timing results.

Anchor Clustering and Filtering:
The raw anchor set consists of the hits found by MUMmer, which are filtered and clustered for input to the synteny algorithm.

Algorithm 1 (modified original) is good for medium-to-high divergent genomes, aligning draft sequence, self-synteny, and genomes with little or no annotation. The MUMmer hits are first clustered into gene, or putative-gene hits. This is done by clustering the hit regions on each sequence, and then defining new "gene" hits which connect these regions. For example if three separate exons hit between two genes, they will be clustered into one "gene" hit having a combined score equal to the sum of the raw hit scores. Clustering is by gene if the hits overlap annotation, otherwise, it creates "candidate genes" from hits that do no overlap annotation.

The clustered "gene anchors" are then filtered using a version of reciprocal-best filtering which is adapted for retaining duplications and gene families. For each pair of genes (or putative genes) which is connected by a clustered anchor, the retained anchors must be among the top two anchors by score on both sides (top-2 allows for one ancestral whole-genome duplication). An anchor will also be retained if its score is at least 80% of that of the 2nd-best anchor on each side (this allows for retention of gene family anchors). These filter parameters may be adjusted through the Align&Synteny Parameters panel.

Algorithm 2 (exon-intron) is good for low-to-high divergent genomes with good annotation. It directly maps hits to the exons and introns. Hits aligning to un-annotated regions are clustered separately. There are many more parameters for this approach, as the hits are filtered based on the parameter values.

Synteny Block Detection:
After the clustered anchors are loaded into the database, the synteny synteny block algorithm runs. This algorithm looks for approximately-collinear sequences of anchors, subject to several parameters including (A) Number of anchors; (B) Collinearity of the anchors; (C) Amount of "noise" in the surrounding region (to help reject false-positive chains). Criterion A can be adjusted in the Align&Synteny Parameters panel.

* Note that the sequences are re-written from the database to the disk for three reasons: (A) To allow re-grouping for efficiency; (B) To ensure elimination of invalid characters; (C) To mask non-gene regions, if desired. This also ensures that sequences names will match those in the database, and prevents problems caused by moving the source sequences on disk.

FPC project

For working with FPC8,9, it is suggested you use release v5.0.8 from SyMAP releases.
  1. It has the FPC demo files.
  2. It has BLAT10 in the /ext directory.
  3. It has the tar file doc.tar.gz of the documentation.
  4. The AGCoL documentation applies to this release.
If you run into any problems, please do not hesitate to contact cas1@arizona.edu.

References

1 Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.L. (2004). Versatile and open software for comparing large genomes, Genome Biology, 5:R12

2 Marcais, G., A.L. Delcher, A.M. Phillippy, R. Coston, S.L. Salzberg, A. Zimin (2018). MUMmer4: A fast and versatile genome alignment system, PLoS computational biology, 14(1): e1005944.

3 Krzywinski, M., J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. Jones, M. Marra (2009). Circos: An information aesthetic for comparative genomics. Genome Research doi:10.1101/gr.092759.109.

4 Soderlund, C., Nelson, W., Shoemaker, A., and Paterson, A.(2006). SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res. 16:1159-1168.

5 Soderlund, C., Bomhoff, M., and Nelson, W. (2011). SyMAP: A turnkey synteny system with application to multiple large duplicated plant sequenced genomes. Nucleic Acids Res V39, issue 10, e68.

6 Katoh, Standley (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772-780.

7 Edgar, R (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 113.

8 Soderlund, C., S. Humphrey, A. Dunhum, and L. French (2000). Contigs built with fingerprints, markers and FPC V4.7. Genome Research 10:1772-1787.

9 Engler, F., J. Hatfield, W. Nelson, and C. Soderlund (2003). Locating sequence on FPC maps and selecting a minimal tiling path. Genome Research 13:2152:2163.

10 Kent, J. (2002) BLAT--the BLAST-like alignment tool, Genome Research 12:656-64.

Go to top

Email: cas1@arizona.edu