Description, Instructions, and Tips for MS-Pattern

Purpose

This document provides instructions for MS-Pattern.


Contents of this document:

Links to topics in the general instructions:

Introduction and Background

MS-Pattern is a database pattern matching program that has been largely superceded by MS-Homology. It has been retained as it does have a few complementary features:

  1. It uses a completely different algorithm based on text regular expression matching. This does have a few features which aren't duplicated in the MS-Homology program. Details can be found in the Regular Expressions section.
  2. As it uses a different algorithm there may be situations when it is faster than MS-Homology.
  3. The form is somewhat simpler than the one used by MS-Homology.
  4. There is a Pre Search Only feature.

c

An ordinary character (ie. not one of the special characters discussed below) is a one character regular expression that matches that character

\c

A backslash followed by any special character is a one character regular expression that matches the special character itself.

The following are always special characters in a regular expression except when they appear in square brackets ([]) :

. dot
* asterisk
[ left square bracket
\ backslash

^ caret

is a special character when it is at the beginning of an entire regular expression or when it immediately follows the left of a pair of square brackets

$ dollar

is special at the end of an entire regular expression

Square brackets have special meaning in a regular expression. The regular expressions used are of the form used by the UNIX grep facility. Examples (type man grep on a UNIX system for full details):

[EF] The amino acid is either E or F.
[A-IK-Y] the amino acid is either alphabetically between A and I or K and Y
[^EF] The amino acid is anything but E or F.
. Any single amino acid is possible.
.\{1,4\} Any number between 1 and 4 of any amino acid.
.* Used to represent a sequence of one or more unknown amino acids. Note that this is "dot-star" not just "star". This wildcard allows some not entirely obvious features. A match is to the longest sequence fitting the condition (ex: FMQ .*K will find the last K in the sequence following FMQ).

By setting the Max. # of Mismatched AA's parameter to a value other than 0, homologous sequences can be matched. This is done by allowing a number of positions, as determined by this parameter, not to match protein sequences in the database. In future revisions of MS-Pattern this parameter may be replaced by PAM matrices used in sequence homology programs like BLAST.


Pre Search only mode does a search based on the pre-search parameters. Some examples of the use of this would be to:

  1. List all the proteins in a specific molecular weight or pI range for a particular species.
  2. List all the proteins in the database where the name field contains the word keratin.
  3. List all the proteins for a specific species code.

If the number of matches is greater than the Max. Reported Hits then the report reports the total number of hits but only lists the first Max. Reported Hits hits.

The Pre Search hits can be saved to a file and used as input for future searches.