Batch Searching and Analysis using Protein Prospector

Contents

This document provides a summary of features in Protein Prospector specifically designed for the analysis of large numbers of MS/MS spectra submitted as a batch. The analysis of this type of data is split between two programs in Protein Prospector. First, 'Batch-Tag' searches the data against a database. The results of this search are then summarized in a second program 'Search Compare'. A description of the workings and performance of these two programs was first published in 2005 (Chalkley, R.J. et al. Mol Cell Proteomics (2005) 4 8 1194-1204). An updated description, which describes Expectation value calculation was published in 2008 (Chalkley, R.J. et al.Mol Cell Proteomics (2008) 7 12 2386-2398).

On the Protein Prospector home page there are links to two different Batch-Tag forms. Batch-Tag Web is used when the data needs to be uploaded to the server. If the data is already on the server then you should use the Batch-Tag form. If you are using the public web server you should initially use the Batch-Tag Web form. If you want to re-search the data again with different parameters you should use the Batch-Tag form.


The results of all Batch-Tag searches are saved by Protein Prospector and can be viewed at any time using Search Compare. To prevent all users from having access to everyone else's results, the user must enter a username and password when using Batch-Tag and Search Compare. This allows the user to browse through all of their previous results, and also allows the comparison of search results.


When selecting Batch-Tag, Batch-Tag Web or SearchCompare, a Login page will be brought up, where the user must enter their username. If you do not have a username, you will need to create one by clicking on "Add User". This will bring up a separate page where you must enter a username, password and e-mail address. User names must use all lower case letters and numbers.


There are a number of protein and gene databases available for download from the web. However, they differ widely in terms of number of entries and redundancy. Of the databases available on the website, SwissProt is the smallest but best annotated, Uniprot (the combination of SwissProt and trEMBL) is significantly bigger but less annotated and NCBInr is the biggest but most poorly annotated. Concatenated databases are available that can be used for estimating false discovery rates in results.

Species limited searches in Protein Prospector programs are performed by means of pre-filtering database entries according to the user-designated species prior to searching. Combinations of species can also be searched (e.g. mammals). It is possible to select more than one option from the taxonomy list, providing one of them is not 'All'. In addition to the list of options in the taxonomy list, any taxonomy identifier from the NCBI taxonomy browser (e.g. Green Plants) can be entered in the 'Taxonomy Names' field in the 'Pre-search Parameters'.

This is the name given to the search results when viewed in SearchCompare.

These are a set of parameters that allow you to filter the entries in a database that are searched. One can restrict the search by protein MW or pI. One can input a list of species codes to be searched, restrict to proteins with a certain word in their name, specify a list of accession numbers to search or add additional accession numbers from proteins that would not be considered based on other filtering parameters (e.g. you may want to add pig trypsin as a possible match to a search of bacterial proteins).

Some peaklist generation software does not assign a charge state to precursor ions. If no charge is listed, the data will be searched using the entire list of charge states selected in the Precursor Charge Range field.

The mass tolerance for the precursor ion can be expressed in parts per million (ppm), Daltons, mmu (millimass units) or % of mass.

Sometimes the calibration of data can contain a systematic mass error. For example, all precursor masses may have errors between +60 to +100ppm. In this situation, searching the data with an 80ppm systematic error and a MS/MS Parent Tolerance of 20ppm would give more reliable results without losing any correct answers. Peptide Results Reports display a histogram of mass errors if there are over 50 peptides matched and will report a mean systematic error. The units of systematic error will be the same as for the Parent Tol.

The mass tolerance for the fragment ions can be expressed in parts per million (ppm), Daltons, mmu (millimass units) or % of mass.

This parameter defines the ion types that are searched for when matching the MS/MS fragment masses (see table) The ion types searched for are the same whether it is a MALDI or ESI instrument, but different weighting for ion types are used in the scoring. For those options defined as ‘low res’ the search engine assumes an inability to determine fragment ion charge state, and will consider an ion as potentially being singly- or doubly-charged.

Q-TOF TOF_TOF ION-TRAP FT-ICR-ECD FT-ICR-CID ETD
a
b and y
c and z
c-1 and z+1
a loss
b loss
y loss
Internal
Immonium
Internal loss
d, v, w

This defines the cleavage specificity assumed when searching the database for MS/MS parent masses. Some combinations of enzymes are available.

It is possible to search for non-specific cleavage at either one of the peptide termini or one can relax the specificity at a specific terminus. Note this will dramatically increase search times, so it is recommended to only use this on a small database or for searching against a list of accession numbers.

This defines the maximum number of missed enzyme cleavage sites present in a peptide for it to be considered in the database search.

Modifications selected here will be assumed to be always present.

If you tick the Save Settings box then the current form settings will be saved to a cookie when you press the Start Search button. If you do this the search will not be performed so you will have to return to the Batch-Tag/Batch-Tag Web form and uncheck Save Settings to do the search. You thus shouldn't try to upload a file when saving the settings.

There is a browser specific maximum cookie length which is generally 4096 characters. This should be sufficient for most combinations of parameters. However if you include long lists of accession number, species names or variable modifications it is possible to exceed this limit. In such cases an error message will be generated and the cookie will not be saved.

You need to be careful when saving parameters which are generally hidden as these will be applied to all future searches.

The cookie which is saved is called batchtag_params and you can delete it using your browser's cookie management facility. If you use multiple different Protein Prospector sites each one will have a different cookie.

This option allows you to turn off the expectation value calculation. For searches of very small databases (especially searches by accession number) the expectation values reported become inaccurate. The expectation value calculation involves searching a randomized database before the standard database search, so by turning off Expectation value calculation it can speed up searches.

The database used for the expectation value search is defined in the expectation.xml file. Unless this has been changed from the default the SwissProt database is used. The entries are randomized on the fly and the spectra are searched against the database until enough random peptides have been generated that match the precursor mass of the spectrum. There is a facility for looping through the database multiple times if enough random peptides are not generated and a facility for aborting this process at some point if this is not possible within a reasonable time.

If you use the option Linear Tail Fit the seed for the randomization is based on the time. This means that the randomization is different every time this is done and that if you upload the data multiple times in different projects the expectation values, and thus the results, will be slightly different. This reflects the statistical nature of the results and the uncertainty in the process.

If you want to avoid this problem it is possible to use the option Linear Tail Fit With Consistent Seeds which will use the same seeds when randomizing the database. This will work as long as the SwissProt database is not updated.

There is a great variety in the quality of peak picking by software when producing peaklists to be submitted for database searching, with some peaklists being essentially an unprocessed list of hundreds of masses, most of which are noise. Protein Prospector takes the mass range of the peaks submitted, splits it in half, then for most instruments takes only the top 20 peaks in each half of the spectrum to produce a list of 40 peak masses that are used for database searching. 40 peaks has been found to be in most cases the optimal number for tryptic peptides to give specific answers without introducing too many false matches to noise peaks. However, in some cases there may be a benefit to increasing or decreasing this number. For example, if peptides are generally long (maybe an enzyme was used that cuts infrequently) then there will be more real fragments in the spectrum, so increasing this value may be beneficial. Also, if you are analyzing post-translationally modified peptides and want to identify sites of modification, then allowing for more peaks that may assist in site assignment may be important. A second option is to specify the Max MSMS Peaks/100 Da. Typical suitable values would be in the range of 4 to 10 peaks/100 Da

Modifications selected here will be searched both as if the modification is present or absent. To select multiple modifications hold down the 'Ctrl' ( '⌘' on a Macintosh ) keys and click the modifications you would like to add. Similarly, to deselect modifications, hold down 'Ctrl' or '⌘' and click on the modification name.

MS-Tag/Batch-Tag first use a precursor mass filter to identify candidate peptides. If variable modifications are also selected then a match will be to a given peptide with a given combination of modifications. Sometimes there are multiple amino acid sites on which these modifications could be located and Protein Prospector considers each permutation of these sites separately. When looking for certain modifications with highly charged data then the number of sites can get very large indeed with a dramatic impact on the search speed. One modification where this is a particular problem is phosphorylation.

If the number of permutations of variable modifications in a peptide exceeds the value of the Max Peptide Permutations parameter then the peptide is skipped in the search and won't appear in the results.

If this parameter is left blank then all permutations are considered.

This maximum restriction on the number of variable modifications per peptide can significantly speed up database searches

N.B. This is a function that should only be used when searching a very restricted list of proteins. We recommend only using this on a list of accession numbers of proteins already identified in the sample from an initial search.

It is possible to search for unanticipated mass modifications on any or all amino acids. A mass range for modifications is specified. A given mass modification is unlikely to be an exact integer change, so the Defect option defines an adjustment to the nominal mass shift that allows the user to still search with reasonable mass tolerance restrictions (although we recommend employing less restrictive parent and fragment tolerance restrictions than normal when analyzing data that has high mass accuracy, such as TOF, FT or Orbitrap data). The neutral loss option will look for modifications that are immediately lost upon fragmentation; i.e. it will assume there is no modification when matching fragment ions. This can be useful for identifying labile modifications on peptides such as O-GlcNAcylation or sulfation. For neutral loss modifications, no modification is reported on the peptide sequence, but is indicated by the mass error on the precursor mass.

If the Uncleaved checkbox is selected then mass modifications are not considered at digest cleavage sites.

Mass modifications to the N and C terminus of a peptide can be restricted to peptides at the N or C terminus of the protein by using the appropriate menu selection.

If the Rare checkbox is selected then peptides containing mass modifications cannot occur in a peptide with a modification which also has a Rare specifier. This option could be useful for a glycosylation search where known glycosylations would be reported in preference to unknown ones.

The Acc # option can be used to limit the mass modification search to particular accession numbers and optionally sites (ie residue numbers) within an accession number. Using this option would also limit the possible hits in a crosslink search

For example to limit the search to the SwissProt entries Q58DW5 and P46777 you should enter:

Q58DW5
P46777

If you are searching a User Protein then use the index number of the protein as the accession number. Eg:

43
19
42

This would search the 43rd, 19th and 42nd protein. As you can see the numbers don't need to be in order.

To limit the search to a set of sites in a given accession number then these are specified after the accession number in a space separated list. Eg:

43 178 188 220 228
19
42

If you are searching multiple databases (eg SwissProt and User Protein) the format is as shown below:

>SwissProt.2016.9.6
Q58DW5 89 242
P46777 89 242
>User Protein
43 89 242
19 149
42 142

If you were searching say SwissProt.2016.9.6.random.concat database you should still just enter SwissProt.2016.9.6. The random version of the accession number will then also be searched.

Expectation values reported for this type of search are likely to be inaccurate. Hence, we recommend performing this type of search against a concatenated database, so that it is possible to determine a suitable acceptance threshold.

In the Limit Crosslinks section the Allow item can be used to limit the crosslinks to either Intraprotein or Interprotein crosslinks. One reason for doing this is that an intraprotein crosslink search will be much more sensitive than an interprotein search as just under half the crosslinks in a typical dataset will be intraprotein. Thus the acceptable score threshold for the two searches will be different.

You can also limit the search to particular pairs of accession numbers by using the Acc # Pairs option. This requires a list of space separated accession number pairs. Intraprotein pairs would specify the same accession number twice. If you have a User Protein the accession number is the index number (ie the ith User Protein). Example:

P63244 P60866
P63244 Q8NC51
P61247 P61247
Q00610 Q00610
P13639 P13639
P39023 P39023

Note that there is also an Acc # option in the Mass Modification section. You probably want to include any accession numbers from the Acc # Pairs option in there as well. Otherwise what it will do is still do a mass modification search on any proteins not specified in the Acc # Pairs option but just won't pair them into crosslinks.

The Best Peptide Max Rank option will limit the limit the rank of the highest scoring peptide to that specified. Thus if you were to specify 10 then one of the peptides would have to be at least the 10th best scoring peptide or better. The other peptide can still have a rank down to the value specified in the # Saved Tag Hits field. If the Best Peptide Max Rank option is left blank then both peptides can have a rank down to that specified in the # Saved Tag Hits field.

These are combinations of modifications that can be searched at once, such as any amino acid substitution (Homology) or amino acid substitutions that would result from single base changes in the genome.

It is possible to create named sets of filtered spectra using either MS-Filter or Search Compare. One or more of the sets can be combined to limit a Batch-Tag search to only the selected spectra. If you have created one or more sets of spectra for a particular project then items will appear on the Batch-Tag form to allow you to select them.

Filtered Spectra is a multiple choice item allowing you to select one or more sets of filtered spectra.

Combine Filtered Spectra Sets can be set to Union, Intersection or Difference and dictates how the sets are combined.

Exclude Filtered Spectra selects all the spectra that are not in the combined list.

A set of spectra from MS-Filter may be created at the point of selecting a project for searching. When selecting the project check the MS-Filter checkbox along with the selected project. When you click Select Project this will take you to an MS-Filter page. Select the required MS-Filter parameters and enter a name for the spectra set. This needs to be unique for a given project. For example you could select just the spectra with charge 3 precursors. Or you could select just the spectra with a 204 peak.

A set of spectra from Search Compare may be created by selecting Filter Spectra Lists from the Format option at the top of the page. You need to enter a name into the Filtered Spectra List Name box to name your filtered spectra set. There is also an Unmatched checkbox to select the spectra not matched. Click on Submit to create the set. A typical use of a Search Compare created spectra set would be to create a set of unmatched spectra to do a second search on. Eg you could search for linear peptides and then do a second search for crosslinked peptides that would eliminate the linear peptides. It is best to select a Time report when creating a set of spectra from Search Compare as then all matching spectra are included.

A set of filtered spectra is stored in a file with a .idx extension. It simply contains a list of index numbers to the selected spectra. The index numbers can span multiple peak list files. The file is a binary file consisting of a sorted list of 8 byte unsigned integers.

The filtered spectra sets can be deleted using Results Management.

Batch-Tag accepts peak list data in the form of lists containing m/z, intensity and charge. It will accept peak lists in mgf, dta, pkl, mzXML and mzData file formats. There is no need to specify the file format; Protein Prospector will try and automatically recognize it. Compressed files in zip, 7z, rar, gz, z, bz2 and cmn are supported along with tar archives and compressed tar archives (.tar, .tgz, .tar.gz, .taz, .tar.z). These compressed files may contain many peak list files, allowing the searching of multiple files in one search. The compressed file can also contain raw data from Xcalibur (Thermo), Analyst (ABI) or 4700 Explorer (ABI), if you want to do quantitation analysis. If you want to upload raw data for quantitation analysis, then the raw data file and the corresponding peak list file must have the same name (apart from the filetype suffix). The files must be combined together into a single zip or gz file for upload through BatchTag-Web. Protein Prospector is able to determine charge states and de-isotope fragment ions in peak lists. This can be very beneficial when the data is of sufficient resolution to reliably determine the charge state; e.g. TOF or FT-ICR data. It means that instead of creating an imaginary peak type that has half the m/z of expected fragment ions to look for doubly-charged peaks, it can directly identify the multiply charged peaks. This significantly reduces the chances of random peak matches. Hence, we recommend submitting peak lists that have not been de-isotoped, because if the isotopes have been removed it is impossible to determine the charge state. Of course, for low resolution data, such as ion trap data, charge state determination is usually not possible so the ion type at half m/z has to be created to try to find doubly-charged fragment ions.

Protein Prospector will assign a project name for your uploaded data based on the name of the file that you upload (without the file name suffix). Thus is the file you upload is called data1.mgf then the project will be called data1. If the file you upload is called data2.zip the project will be called data2. The project name is important as it is the name used by the Search Compare, Batch-Tag and Results Management programs to access the data set. Because project names need to be unique for a given user then the upload filenames also need to be unique. The maximum length of a project name is 58 characters.

If your peak lists are in a format that Batch Tag currently does not accept then e-mail us a small sample file and we will try and incorporate your file format.

When the search is executed a job status page will be brought up that will report the progress of the search. For a new project Protein Prospector will perform two searches; one for calculating score distributions for expectation value calculation and one for identifying the peptides. Hence, when the progress reaches 100% it will start again at 0% and the search will finish when it reaches 100% for a second time. If you perform a second analysis on a dataset, providing you do not change any parameters that will change the score distribution, it will only need to perform a single search for subsequent analyses in this project.

A search daemon manages search submissions. If more searches are submitted than can be performed at one time they will be lined up and as soon as a search is completed the next one is started. Searches are performed in the order they are submitted. Searches continue and complete independently of whether the web browser is still open.


In this section the report type and how it is presented is defined.

Results can either be displayed in HTML format or in a tab-delimited format. The tab-delimited format can be easily copied into a spreadsheet.

The Filtered Peak List option creates mgf peak list files containing just the spectra in either a peptide or time report.

The MS-Viewer Files option creates both the peak list and results files required by MS-Viewer.

The pepXML Filtered option creates a pepXML file which contains the results after Search Compare filtering.

The pepXML Unfiltered option creates a pepXML file which contains the results produced by the Batch-Tag program. This is they are unfiltered by any Search Compare options.

It is possible to filter the results to only view results for proteins on a list of accession numbers. Conversely, one can remove selected accessions from the list by checking the 'Remove' button.

If you are multiple databases are involved then you need to specify the databases that the accession numbers are from as shown below:

>NCBInr.2011.10.06
30377
13537138
>SwissProt.2011.10.06
Q99456
Q2M2I5

This should be highlighted when the peak list searched includes data from several different samples; e.g. MS/MS spectra acquired on a TOF-TOF from several unrelated spots. This will produce a separate summary for each spot rather than combining the results from all spectra. Similarly, if multiple peaklists were submitted for one search, this will split the results by peaklist. It is also possible to filter the results to only show results from one spot or one peaklist using the Spot/Fraction field.

Often if you do a database search then some of your hits will not be unique to a particular species and the species entry which is displayed can be somewhat random. The preferred species option allows you to enter strings from either the NCBI taxonomy browser or the Swiss Prot controlled species vocabulary to describe the mix of species in your sample. The software will then find the nearest matching in the taxonomy tree from the matches that you have, as long as the matches are equivalent. For example if your sample was a mixture of yeast, ecoli and human proteins and you could enter:

YEAST
ECOLI
HUMAN

If it so happened that a peptide happened to match both a YEAST and an ECOLI protein then the YEAST one would be preferentially displayed.

FDR Filtering is available if:

1). The search has been done against a database with a concatenated random or reverse database.

2). Expectation values are available.

3). The data is not crosslink data.

FDR filtering attempts to calculate an e-value limit at which the percentage number of decoy hits falls below a specified limit. If you choose the option FDR limits only then the score limits for the other score thresholds (see below) are the hardcoded values contained in the file expectation.txt. These are typically very liberal thresholds to permit sufficient false positives. The option Score and FDR limits allows you the user to set the other score thresholds. If you use this setting there may not be sufficient false positives to achieve the requisite false positive rate. The Score Limits Only option bypasses any FDR filter.

It is possible to set a Protein and/or a Peptide FDR. The Peptide FDR can be set to Peptide, Peptide and Charge or Spectral. The Peptide setting will only consider the hit for a given peptide/modification combination with the best e-value. The Peptide and Charge will only consider the hits for a given peptide/modification/precursor charge with the best e-value. The Spectral setting considers all hits. Note that only the best scoring hit for a given spectrum is considered.

The discriminant score is a score that is the combination of two measures of the search result. One is the expectation value for the peptide match (a measure of the likelihood that a match is random) and the other is a 'best peptide score', which takes into account the fact that if a protein has been confidently identified in a sample, it is more likely that other peptides will identified from the same protein. As a default this threshold is set to 0 and, for a normal database search, discriminant scores below 0 are generally incorrect, whilst those above 0 are mostly correct. At the top of the search results is a plot of all the discriminant scores. This should contain two distributions: one for incorrect (random) matches and one for correct.

Batch-Tag uses a simple scoring scheme based on a certain score for each ion type. These two parameters set a minimum quality standard for a spectrum to be accepted. The minimum protein score is typically set higher than the minimum peptide score to require a higher standard if a protein is going to be identified on the basis of a single peptide.

An expectation value is a measure of how many times an event is expected to happen at random. An expectation value of 0.1 means that if the search was repeated ten times you would expect one random match.

For a protein to be reported, it has to have a peptide with an expectation value less than the Max EValue Protein threshold. Other peptides from the reported proteins are reported as long as their expectation values are less than the Max EValue peptide threshold.

If Best Peptide Only is selected Search Compare will only report the best match if the same peptide has been matched multiple times. The Best Per Charge option will retain the best match for each charge state. Keep Replicate Peps will report redundant peptide identifications.

For each spectrum the top five scoring matches are saved, but only the one with the best Discriminant score is reported. By deselecting this parameter Search Compare can report results other than the top Discriminant scoring match; i.e. it could potentially report more than one peptide to a given spectrum.

If you select this parameter a graph of the discriminant scores is plotted near the top of the report. This can be useful for selecting an appropriate value for the min best discriminant score parameter. Generally you can leave this selected. However if you are doing a very large quantitation analysis (say a hundred fractions or more) you can significantly reduce the amount of memory used by the program by deselecting this. Thus you could do a report without quantitation analysis to establish the discriminant score limit and then deselect this option before doing the quantitation analysis.

The peptide composition option is used along with the composition checkbox to display a column in either the peptide or time report. This column will contain a 1 if the peptide matches the composition options or a 0 if it doesn't. The composition options include amino acids, modifications or mass modifications. Mass modifications should be entered as integers (one per line). If multiple composition options are selected then AND means all the selections have to be present and OR means just one of the selections have to be present. This feature is particularly useful for sorting results in a spreadsheet; e.g. grouping all phosphorylated peptides together.

Three options: a protein level report, peptide level report or a time report. The time report lists every MSMS spectrum acquired in order of acquisition, with the match (if there was one). Those spectra for which there was no match will just report the number of peaks submitted.

This defines how to deal with homologous proteins. The default is 'interesting' which means a homologous protein will only be reported if there is at least one unique peptide matching to the protein. Occasionally proteins will be reported as homologous when the level of homology may be fairly low (e.g. only two out of ten peptides are identical between proteins).

The Separated/Merged menu is shown if multiple search results are selected. If you select Separated then the results for each search have their own set of columns in the report. If you select Merged then the results are combined together into a single set of columns. This is useful, for example, for combining CID and ETD results into a single report.

Defines the parameter that you want to use to order the results. The default is 'Discriminant Score' which we believe is the best way to order the results in terms of reliability. Peptide lists can also be sorted by score, start residue in the protein sequence, elution time/spot number or m/z.

There is a great variety in the quality of peak picking by software when producing peaklists to be submitted for database searching, with some peaklists being essentially an unprocessed list of hundreds of masses, most of which are noise. Protein Prospector takes the mass range of the peaks submitted, splits it in half, then for most instruments takes only the top 20 peaks in each half of the spectrum to produce a list of 40 peak masses that are used for database searching. 40 peaks has been found to be in most cases the optimal number for tryptic peptides to give specific answers without introducing too many false matches to noise peaks. However, in some cases there may be a benefit to increasing or decreasing this number. For example, if peptides are generally long (maybe an enzyme was used that cuts infrequently) then there will be more real fragments in the spectrum, so increasing this value may be beneficial. Also, if you are analyzing post-translationally modified peptides and want to identify sites of modification, then allowing for more peaks that may assist in site assignment may be important.

If this item is set to Max MSMS Peaks then MS-Product presentations of peptide assignments will only display the processed peaklist that was used in the database searching for making the peptide assignment; i.e. the de-isotoped peaklist with either the default 40 or some other number of peaks. Selecting Unprocessed MSMS will tell MS-Product to display the raw peaklist as it was submitted. The Max MSMS Peaks/100 Da setting will select the maximum n peaks for each 100 Da range. See the MS-Product manual for more details. If either Max MSMS Peaks or Max MSMS Peaks/100 Da are selected the parameter settings are also passed to MS-Tag From File.

This parameter is passed through to MS-Tag From File and controls the number of hits displayed. For certain applications, such as looking for cross-linked peptides it is often useful to save more than the default 5 hits.

Time reports can also include the spectra that haven't been matched by checking the Unmatched Spectra box. This allows you to search the spectra individually using MS-Tag from File. This option should obviously be used with care if your data set has a lot of spectra.

This section defines what columns you want displayed in your results summary. Some of these columns are specific to protein reports and some to peptide reports. Most of the column names are self-explanatory, but below is a definition of some of the less obvious.

Error between observed and theoretical precursor mass. Units will be the same as those used for the search; i.e. will be parts per million if the Parent MS/MS tolerance was defined in ppm.

Number of submitted MS/MS fragment peak masses that are not explained by the peptide match.

Prior to the discriminant scoring whether this match was the top, 2nd best match, etc to the particular spectrum.

Score assigned by Batch Tag where points are given for every peak matched to a theoretical fragment ion from a peptide. Different numbers of points are given depending on the ion type and the instrument. For a more complete explanation see Chalkley, R.J. et al.

Difference in Batch Tag score between this match and the sixth best match to this spectrum (sixth match is assumed to be an incorrect, random match). This is a crude measure of how much better than random a give match is.

Report the expectation value.

The p value is the probability that the match could have occurred by chance. It is the expectation value divided by the number of trials. In the case of matching MSMS spectra the number of trials is the number of peptides selected by the precursor mass filter. The p-value is independent of the database search and is a direct measure of how good the match to a spectrum is.

The negative log of the P-value multiplied by 10. This transformation is sometimes used to linearize p-values and to give better matches higher scores.

The number of peptides from the database that match the precursor ion's m/z value.

The expectation value calculation make use of a survival curve of the top 10% of scores of peptides from the database that match the precursor m/z. This is the gradient of this curve.

The expectation value calculation make use of a survival curve of the top 10% of scores of peptides from the database that match the precursor m/z. This is the offset of this curve.

More reliable scoring system than the Batch Tag score, this is calculated by combining 'Best Peptide Score' and Expectation value.

Reports how many protein entries in the database contain this same peptide sequence (allowing for substitutions of Leu and Ile).

Sum of Batch Tag peptide scores.

Number of different peptides matched to a given protein. Multiple matches of the same peptide at the same or different charge states are defined as a single unique match, but modified versions of the same peptide (e.g. methionine oxidized version) count as separate unique matches.

This is the number of peptides reported for a given protein whether unique or not. This is only different from the number of unique peptides if you have the Keep Replicate Peptides option selected. Some basic quantitation methods make use of peptide counts.

Batch Tag score of the highest scoring peptide matched to a given protein. This is one of the parameters used for the discriminant scoring.

This is the parameter by which a Protein report is sorted. It is a sum of the discriminant score for distinct peptides. The discriminant score chosen is the one with the best expectation value. All the summed scores must also be from different spectra.

Highest discriminant scoring peptide to a given protein.

The peptide column displays the database peptide. The mod reporting option controls how constant and variable mods are reported. These can optionally be reported in separate columns or together in the same column. If there is any ambiguity in the peptide hit then the variable modification reporting will contain either a SLIP score or alternative modification positions. Alternative modification positions are displayed if the SLIP score is below the SLIP score threshold. Some examples:

Oxidation@6

means that the residue at the 6th amino acid in the peptide is oxidized and there is no alternative site for the oxidation based on the search parameters.

Phospho@1;Oxidation@6

means that both residues 1 and 6 are modified and there are no alternative sites.

Phospho@2=5;Oxidation@6

Here the Phosphorylation at residue 2 has a SLIP score of 5 so there is an alternative modification location. If the SLIP score threshold were increased to 6 then this would be displayed. Eg:

Phospho@2|4;Oxidation@6

Sometimes the ambiguity is more complex. Eg:

Phospho@1&2|1&3|2&3;Oxidation@6

Phospho@2&10&12|6&10&12|6&10&15|6&12&15

Phospho@1&Oxidation@6|Phospho@2&Oxidation@9

HexNAc&Phospho&Phospho@(17&13&(20|21|23|24|25|27|28))|(20&(9&24)|(13&(17|21|23|24|25|27|28)))|(21&13&(17|20|23|24|25|27|28))|(23&13&(17|20|21|24|25|27|28))|(25&13&(17|20|21|23|24|27|28))|(27&13&(17|20|21|23|24|25|28))|(28&13&(17|20|21|23|24|25|27))

The following string describes the case where no modification is one of the options. The first string, Label:15N(1)+Oxidation@1 says that the first residue is unambiguously 15N labelled and oxidized. The second string, |Label:15N(1)+Deamidated@4|13, says that either there is no modification or that the 4th or 13th residue is 15N labelled and deamidated. This ambiguity is caused by that fact that the mass shift for 15N labelled deamidation is only 0.0131 Da.

Label:15N(1)+Oxidation@1;|Label:15N(1)+Deamidated@4|13

The || symbol is used to symbolise "or" for fundamentally different explanations. Eg:

430.2264&Oxidation@(2&1)||446.2213@2

This means the hit is either:

430.2264@2 and Oxidation@1

or

446.2213@2

In this case a different number of mods are involved in the alternative explanations.

A more complicated example is:

1519.6206&Xlink:DSS1:2H(12)@((7&12)|(12&7))||1529.6803&Xlink:DSS1@((7&12)|(12&7))||1530.6643&Xlink:DSS2@((7&12)|(12&7))||1685.7590@7|12

Splitting this by || gives:

1519.6206&Xlink:DSS1:2H(12)@((7&12)|(12&7))

or

1529.6803&Xlink:DSS1@((7&12)|(12&7))

or

1530.6643&Xlink:DSS2@((7&12)|(12&7))

or

1685.7590@7|12

If you click on the peptide sequence an MS-Product report of the match will be displayed. Multiple peptides are displayed if there is any ambiguity in the hit.

SLIP scores are usually differences in -10logP scores between the peptide with the modification in the given position and a lower scoring peptide with the modification in a different position. If the Expectation Calc Limit is set to None or an expectation value can't be calculated for a given spectrum then the SLIP score reported is the score difference. If the score difference is reported then the SLIP score will have a decimal point. SLIP scores calculated from P-values are reported as integers.

Note that as it possible to search both constant and variable modifications on a given amino acid then occasionally the ambiguity reported in the variable modifications column can have modifications of different masses on a given residue.

If protein mods are selected then the positions of the variable mods are reported relative to the start of the protein rather than the peptide. This can be useful if there are multiple hits spanning the same part of the protein sequence.

If have selected mass modifications in your peptide this column will contain the integer mass modification. Also a mass modification histogram will be displayed near the top of the report.

Spectrum number: This will report four columns: Fraction = peak list file peptide was identified from (somewhat redundant option in current public version as only one file can be searched at once); RT = LC retention time / Spot number; R = run number (only relevant to TOF-TOF data); # = MS/MS acquisition number on this spot/retention time. For LC-MS/MS data, if the software that created the peak list created duplicate peak lists for a given spectrum with different charge states, then this will reflected in the # column.

Number is the protein rank within the results. For proteins that are homologous (interesting: see above) to another they will be ranked the same; e.g. 5-2 means that this protein is homologous to protein hit 5, but is the second best match within this set of proteins.

Uniprot IDs are currently just reported for databases from Uniprot. There is some information on them here.

Gene names are currently just reported for databases from Uniprot. There is some information on them here.

The number of amino acids in the protein.

Deselecting this will mean that all links within the report; e.g. clicking on a protein in the protein summary to get a peptide summary for that protein, will be disabled. This significantly increases the speed in which the page is displayed.

If you select checkboxes then the report will contain a column of checkboxes which can be used to select or eliminate results from the report. For example you might want to manually look at the quantitation results and eliminate obvious outliers. Another possibility would be to just include peptides in the report with particular modifications. For example you could just include phosphorylated peptides in the report. A small form is displayed at the top of the report to facilitate this.

If the user has uploaded raw data files from one of the supported data types (Thermo .raw; ABI .wiff or ABI .t2d) along with the peaklists, then Search Compare can read the raw data to extract quantitative information. For details about how to upload raw data for quantitation see here. If ‘Raw Type’ of MS Presursor is selected it will report information about the precursor peak; whereas if ‘Raw Type’ of Quantitation is selected it will calculate quantitation assuming isotopic labeling with the option selected in the ‘Quantitation’ pull-down menu.

Search Compare can calculate protein-level median, interquartile range (IQR) (the values that bracket the middle 50% of measurements), mean, +/- any number of standard deviations from the mean and how many peptides were used for quantitation (num). Intensities or peak areas can be used, and a minimum intensity or area threshold may be applied for a peak to be used for quantitation. For MS-based quantitation it is possible to average together scans over a period around the time a peak was selected for MSMS, which will give better ion statistics and accuracy. It is necessary to specify an approximate resolution of the data for the purposes of peak detection. For isotopic labeling strategies, if it is known that there was not full incorporation of the isotope into the heavy reagent, this can be compensated for.

1). If you are using Batch-Tag Web create an archive file (zip, rar, 7z, tgz, etc) containing the centroid files and the raw data files. The exact contents of the file are instrument dependent. If you are doing quantitation using MSMS reporter ions (iTRAQ, TMT, etc) it is not strictly necessary to upload the raw data files. However you should make sure that any deisotoping done by the peak list generation software doesn't affect the reporter ion intensities. The name of the archive file will be the project name by which you access the data in the future using Search Compare. All project names for a given user need to be unique. If you have an in-house version of Protein Prospector with a data repository you can use the Batch-Tag Make Project... option instead of Batch-Tag Web.

2). Select the archive file using the Batch-Tag Web Browse... button and set the search parameters. The labelled amino acids are generally chosen as constant modifications for quantitation methods using MSMS reporter ions or variable modifications otherwise. Press the Start Search button to do the search. The data file my take quite a long time to upload depending on your internet connection and the size of the file.

3). Once the search is finished you need to access the results via the Search Compare form. If you kept your browser open during the search this will appear automatically. If you receive emails once searches have finished you can access it by clicking on the link in the email. Otherwise select Search Compare from the Prospector home page, login, select the project and then the results name (as entered in the Results Name item on the Batch-Tag/Batch-Tag Web form).

4). On the Search Compare form:

a). Set Report Type to Peptide.

b). Press the + next to Raw Data/Quantitation to open this section.

c). Set the Raw Type to Quantitation.

d). Select the appropriate quantitation method from the Quantitation menu.

e) Tick the L/H Int box for peak intensities or the L/H Area box for peak areas.

f). Set the Resolution appropriately. The value isn't critical but has to be a reasonable starting point. If the resolution is 50000 it is not sufficient to leave it at 10000.

g). Search Compare has an option to measure the CS (Cosine Similarity) of the matched isotope profile to the theoretical isotope profile. The first 3 peaks of the isotope profile are used. A perfect match gives a cosine similarlity of 1.0. You can limit reported quantitations to matches where the cosine similarlity exceeds a specified threshold by entering a value into the CS Limit field.

h). Press Compare Searches.

In the resulting report you can click on the masses in the m/z column to get a detailed report of the quantitation calculation for a given peptide.

5). If you set the Report Type to Protein and tick the Median box this will give a single ratio for each protein. The IQR tick box will give the interquartile range in 2 columns.

If you tick the Save Settings box then the current form settings will be saved to a cookie when you press the Compare Searches button. If you do this the report will not be generated so you will have to return to the Search Compare form and uncheck Save Settings run the program.

There is a browser specific maximum cookie length which is generally 4096 characters. If you exceed this limit an error message will be generated and the cookie will not be saved.

You need to be careful when saving parameters which are generally hidden as these will be applied in the future.

The cookie which is saved is called search_compare_params and you can delete it using your browser's cookie management facility. If you use multiple different Protein Prospector sites each one will have a different cookie.

At the top of each search result there will be a histogram plot of discriminant scores. Within this plot there should be two distributions, one for the correct answers and one for the incorrect answers. The distribution for the incorrect answers will always be much larger than the correct because for each spectrum Batch Tag is saving the top five results, so even if a correct answer is determined for every spectrum the incorrect distribution will be four times bigger.

After the discriminant score histogram there is a button 'Batch Tag of Listed Accession Numbers'. This allows you to do a subsequent database search of only the proteins that were listed in this report. Clicking on this will launch a new Batch Tag page where the accession numbers of the proteins identified in the first search are input into a new search page. This is a good approach for looking for post-translational modifications or peptides formed by enzyme non-specific cleavages of proteins that you have already decided are present.

If a search was performed against a concatenated database, it will report the number of proteins identified to the target database and separately the number matched to decoy sequences. Matches to decoy proteins are displayed with a negative accession number.

Clicking on a number in the Rank column will give a peptide report for the protein selected, including a plot of the sequence coverage observed. Clicking on the protein accession number will link to the relevant protein database entry.

In the peptide report, after the histogram of discriminant scores, if there are more than 50 peptide identifications then there is a second histogram that plots the mass errors for all peptide matches above the minimum discriminant score. It also reports the mean error and standard deviations. (Note: for a TOFTOF Multi Sample Report this will be the mean across all spots). The mean error can be used as the 'systematic error' when re-searching data to improve the mass accuracy.

Clicking on the hit number will link to a peptide report including a sequence coverage map. Clicking on the peptide sequence will link to an MS-Product type report, which will display the match to the fragment ions. This report will give a visual representation of the peak matches: red peaks were matched, black peaks were not. It is possible to zoom in and out in this panel. Only ion types that are appropriate to the instrument type will be used. As a default only a, b, c, y and z ions are labeled, but clicking on 'loss' will label water and ammonia loss ions, clicking on 'imm' will label immonium ions, and clicking on 'Int' will label internal ions. Clicking on 'mass' will label the masses of those peaks not matched. Underneath this is a table listing all the peaks submitted and their corresponding matches with errors. Next will be a list of theoretical fragment ions from the peptide, where ions that fragments that were observed are in red. Finally, there is a table that lists all potential fragment ions in order of increasing mass. If a mass modification has been assigned to the peptide, then clicking on the mass in the sequence at the top of the page will link to Unimod, where it will display known modifications with nominally the same mass.

Clicking on a value in the RT column in the Peptide Report will open a link to MS-Tag From File that allows you to search this one spectrum.

For quantitative studies, a boxplot is displayed for each protein, visualizing the quantitation results. A mean and the interquartile range are displayed. Datapoints from peptides that are unique to the protein reported are in red, while peptides that are shared among database entries are in blue. In the peptide report, for ratios where one of the datapoints is below the threshold, the ratio will be reported with a greater than or less than sign to indicate the lack of accuracy of the measurement. If a particular peak is not present at all the ratio will be reported as ‘high’ or ‘low’. In quantitation reports, when clicking on the precursor m/z this will link to the raw data that was used for the quantitation measurement.


This is a version of MS-Tag that can search a single spectrum that is part of a set of peak lists in a file. This allows the user to see the other matches to a particular spectrum based solely on the Batch-Tag scoring system. Also, as only a single file is being searched it is possible to use looser search parameters (e.g. allow for multiple modifications, more missed cleavages, wider mass tolerance) that could be prohibitively slow when searching a whole dataset.

When a peak list is submitted to Batch-Tag if there are more than a certain number of peaks in the list (the default threshold is 40), then Prospector splits the mass range into two and takes the 20 most intense peaks in each half of the spectrum as the peak list for searching. In MS-Tag from File you can change this number. This might be useful, for example, if you think a peptide may be phosphorylated and you want to look for low intensity phosphorylated fragment ions.


Batch-Tag search results are stored in a database, so previous search results can be viewed and re-interpreted at a later date. This also allows comparison of search results against each other. Results are stored by user, so you will only be able to see your own results. Searches of one set of data are grouped together into projects, then within a project the data can be searched with different parameters to create multiple result sets from one dataset.

Using the program 'Results Management' it is possible to:

1). delete unwanted results files or whole projects;

2). export one or more projects into an archive files for later import by another user on the same server or a different server;

3). import a project archive file created by a previous export operation;

4). compress one or more projects so they take less disk space;

5). uncompress one or more previously compressed projects;

5). check that all the files required by one or more projects are present.

There is a date filter which controls which projects are displayed. You can either show all the projects or filter by a project creation or project access date range. The project access date is the last time the project was selected by either Batch-Tag or Search Compare. Note the access date feature only works after version 5.10.4.

When exporting projects there is a Data To Export option. You should only select the No Data option if you are moving the project between users on the same server and the data is stored in a data repository (ie the data used to create the project was not uploaded using Batch-Tag Web). The Peaklists Only option will omit any raw data files. This may be necessary to comply with the 2 GByte limit mentioned below.

As projects have to be imported via a file upload operation there is a 2 GByte limit to the size of an archive file.

When importing a project it is possible to change the project name. This will be necessary it the server already has a project with the same name as the original projects.

Projects containing searches from newer versions of Prospector cannot be imported. Also parameters used in the project (eg amino acid modifications) must match between the Prospector instances.

When a project is compressed all related files in the user repository are compressed to 7z format. Data in the centroid and raw data repositories is not compressed. If you use the Firefox browser an icon will indicate whether or not a project is compressed.

Large files can sometimes take a few minutes to compress or uncompress. You should let the process complete. If for some reason the process does not complete then you will need to redo the compression or uncompression process before the project can be used.

If you select a compressed project for Search Compare or Batch-Tag it will be automatically uncompressed as part of the selection process.

If a compressed projects is exported then it is uncompressed before the export file is generated. After that it is compressed again.

Compressed projects can be deleted.

If you choose to check one or more projects Results Management will report if any of the required files are missing from the repository. After the results have been reported you can either decide to restore the files or delete the project. If any of the projects ar compressed then some of the files will need to be temporarily uncompressed during the checking process. It is recommended that you only check projects when no searches are running.

When you export a project it creates a zip file. The zip file contains 3 folders called data, project and results. There are also some xml files containing data for mySQL database entries. It is potentially possible to assemble one of these zip files manually if for example you lost your mySQL database and you wanted to reactivate your results from the files in the repository. The zip file could then be imported to another Protein Prospector installation. Another reason for doing this would be to import a search from a command line version of Protein Prospector. If you create such a zip file it must have the same name as the project.

The data directory contains the data files used by the project. These are the peak list files and optionally the raw data files. If the data files are already in the lab repository this folder isn't necessary. These data files are referenced by the project file (see next paragraph).

The project directory contains the project file, any associated expectation search files and any associated msfilter files. For example the project file for the project F1 would be called F1.xml and the expectation search files would be F1.exp.1.xml, F1.exp.2.xml, F1.exp.3.xml, etc. The msfilter files will have names such as EZAcinyDHvPHZ4nE.idx and will be in the same repository folder as the project file.

The results directory contains the results files associated with the project and associate discriminant score cache files. The results files have names such 4tQcoggUES7bz9PU.xml. The 16 character random string is the original search key used by Batch Tag. An associated discriminant score cache file for this search key would be called 4tQcoggUES7bz9PU.disc.txt.

One difficulty is associating the results files in your repository with correct project. A way to do this is to look at the parameter block at the top of the results file. The parameter expect_coeff_file gives the name of the expectation search file which has the project name in it. This will not work if you didn't do an expectation value search. You could use the LINUX grep command to pull this line out of all the results files in the directory structure under the repository folder for your user.

Finally here are some examples of the xml files mentioned in the first paragraph of this section. project.xml contains data for an entry from the mySQL projects table. The only lines used from this file during the import process are instrument, calibration_index and record_created. If a project is imported then you can't already have a project with the same name.

<?xml version="1.0" encoding="UTF-8"?>
<?Wed Apr 06 10:02:48 2022, ProteinProspector Version 6.3.23?>
<project>
	<record>
		<project_id>98</project_id>
		<pp_user_id>1</pp_user_id>
		<project_name>F1</project_name>
		<project_file>F1.xml</project_file>
		<project_path>x/b/xbxhwdgdd2/batchtag/project/2017_02</project_path>
		<instrument>ESI-Q-TOF</instrument>
		<calibration_index>0</calibration_index>
		<record_created>2017-02-16 12:17:23</record_created>
		<record_updated>2022-01-04 14:31:25</record_updated>
	</record>
</project>

results.xml contains data for entries from the mySQL search_jobs table. There is a record for each results file. The lines used from the file are search_job_key, search_stage, search_number, num_serial, results_name, priority, job_status, job_signal, node_name, percent_complete, search_submitted, search_started, search_finished and record_created.

<?xml version="1.0" encoding="UTF-8"?>
<?Wed Apr 06 10:02:48 2022, ProteinProspector Version 6.3.23?>
<search_jobs>
	<record>
		<search_job_id>283</search_job_id>
		<search_job_key>W6CgIZyksUG2BV9Q</search_job_key>
		<search_program>batchtag</search_program>
		<search_stage>1</search_stage>
		<search_number>1</search_number>
		<num_serial>1</num_serial>
		<project_id>98</project_id>
		<results_name>results1</results_name>
		<results_file>W6CgIZyksUG2BV9Q.xml</results_file>
		<results_path>x/b/xbxhwdgdd2/batchtag/results/2017_02</results_path>
		<priority>1</priority>
		<job_status>4</job_status>
		<job_signal>0</job_signal>
		<job_segment>NULL</job_segment>
		<node_name>_Dell3994V4J</node_name>
		<node_pid>4440</node_pid>
		<percent_complete>99</percent_complete>
		<search_submitted>2017-02-16 12:17:23</search_submitted>
		<search_started>2017-02-16 12:17:25</search_started>
		<search_finished>2017-02-16 12:19:27</search_finished>
		<record_created>2017-02-16 12:17:23</record_created>
		<record_updated>2017-02-16 12:19:27</record_updated>
	</record>
	<record>
		<search_job_id>284</search_job_id>
		<search_job_key>XcyIw3WR01eoHAbT</search_job_key>
		<search_program>batchtag</search_program>
		<search_stage>2</search_stage>
		<search_number>1</search_number>
		<num_serial>1</num_serial>
		<project_id>98</project_id>
		<results_name>res2</results_name>
		<results_file>XcyIw3WR01eoHAbT.xml</results_file>
		<results_path>x/b/xbxhwdgdd2/batchtag/results/2017_03</results_path>
		<priority>1</priority>
		<job_status>4</job_status>
		<job_signal>0</job_signal>
		<job_segment>NULL</job_segment>
		<node_name>_Dell3994V4J</node_name>
		<node_pid>5512</node_pid>
		<percent_complete>98</percent_complete>
		<search_submitted>2017-03-13 09:25:26</search_submitted>
		<search_started>2017-03-13 09:25:30</search_started>
		<search_finished>2017-03-13 09:26:58</search_finished>
		<record_created>2017-03-13 09:25:26</record_created>
		<record_updated>2017-03-13 09:26:58</record_updated>
	</record>
</search_jobs>

msfilter.xml contains data for entries from the mySQL msfilter table. There is a record for each msfilter file. The lines used from the file are file, name, and record_created. The example just shows a single record but there can be multiple records.

<?xml version="1.0" encoding="UTF-8"?>
<?Wed Apr 06 12:07:19 2022, ProteinProspector Version 6.3.23?>
<msfilter>
	<record>
		<msfilter_id>49</msfilter_id>
		<project_id>98</project_id>
		<file>ucgOZiBpp1REPVpx.idx</file>
		<name>z3</name>
		<record_created>2022-04-06 12:07:04</record_created>
		<record_updated>2022-04-06 12:07:04</record_updated>
	</record>
</msfilter>

If you want to import files created by running Protein Prospector from the command line you will probably need to create the project file yourself. An example file is shown below. There is a file block for each peak list file. The raw data files are optional. This is an example of a project where the data files are in the user repository. This user's files are stored under the x/b/xbxhwdgdd2 directory in the user repository. The x and b directories match the first 2 characters of xbxhwdgdd2. The data files for this particular project are in the x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF directory. The # character at the start of the file path indicates user repository as distinct from data repository. The centroid_name line denotes what is displayed in the fraction column in Search Compare. The file also needs a line indicating the number of spectra in each peak list file. This is required as Batch-Tag splits a job into batches of typically 500 spectra.

<?xml version="1.0" encoding="UTF-8"?>
<?Sun Sep 12 10:23:14 2021, ProteinProspector Version 6.3.23?>
<project>
<project_name>pGlyco_HumanMilk</project_name>
<file>
<centroid>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01_FTMSms2hcd.txt</centroid>
<num_msms_spectra>19072</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01_FTMSms2hcd</centroid_name>
<raw>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01.raw</raw>
</file>
<file>
<centroid>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02_FTMSms2hcd.txt</centroid>
<num_msms_spectra>19373</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02_FTMSms2hcd</centroid_name>
<raw>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02.raw</raw>
</file>
<file>
<centroid>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01_FTMSms2hcd.txt</centroid>
<num_msms_spectra>18321</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01_FTMSms2hcd</centroid_name>
<raw>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01.raw</raw>
</file>
<file>
<centroid>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02_FTMSms2hcd.txt</centroid>
<num_msms_spectra>18457</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02_FTMSms2hcd</centroid_name>
<raw>#x/b/xbxhwdgdd2/batchtag/data/2021_09/6q6lJ53IqnBfl7UF/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02.raw</raw>
</file>
</project>

A project file where the peak list files are in a data repository looks a little different. For example:

<?xml version="1.0" encoding="UTF-8"?>
<?Sun Sep 12 10:23:14 2021, ProteinProspector Version 6.3.23?>
<project>
<project_name>pGlyco_HumanMilk</project_name>
<file>
<centroid>$LumosHCD-high-res/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01_FTMSms2hcd.txt</centroid>
<num_msms_spectra>19072</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01_FTMSms2hcd</centroid_name>
<raw>$Lumos/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_01.raw</raw>
</file>
<file>
<centroid>$LumosHCD-high-res/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02_FTMSms2hcd.txt</centroid>
<num_msms_spectra>19373</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02_FTMSms2hcd</centroid_name>
<raw>$Lumos/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_2000_02.raw</raw>
</file>
<file>
<centroid>$LumosHCD-high-res/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01_FTMSms2hcd.txt</centroid>
<num_msms_spectra>18321</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01_FTMSms2hcd</centroid_name>
<raw>$Lumos/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_01.raw</raw>
</file>
<file>
<centroid>$LumosHCD-high-res/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02_FTMSms2hcd.txt</centroid>
<num_msms_spectra>18457</num_msms_spectra>
<centroid_name>20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02_FTMSms2hcd</centroid_name>
<raw>$Lumos/Outside/pGlyco3/20190307_L1_Ag6_Zhu00015_SA_TestGly_4000_02.raw</raw>
</file>
</project>

Note that the file paths now start with a $ character and the directory structures start with a particular instrument. There is a separate directory tree for peak list data and raw data.

The last issue to deal with is the difference between results files from a web installation and those from a command line installation. The results file first needs to be renamed so that it's filename is a random 16 character (lower and upper case letters and numbers) name such as 6q6lJ53IqnBfl7UF.xml. Also the parameter block at the top of the file needs to be edited. The parameters output_file, output_filepath, project_filename and project_filepath should be removed and a search_key parameter added. If the file name used was 6q6lJ53IqnBfl7UF.xml then the search key would be 6q6lJ53IqnBfl7UF. Another thing to watch for is that you can only import a project into a later version of Protein Prospector.


The Search Table will display the number of searches currently taking place, and the stage/progress of each of these searches. For searches belonging to the user, they will also be able to see search names and there will be a link to the progress for the individual search.


The instructions here relate to Analyst QS 2.0 in conjunction with version 1.6b25 of the Mascot.dll processing script. The latest version of the Mascot.dll processing script can be downloaded from the Matrix Science website and may differ slightly. Unzip the archive, run the installer then take the Mascot.dll file and place it in a direcory which is either called PE Sciex Data\Projects\API Instrument\Processing Scripts or Analyst Data\Projects\API Instrument\Processing Scripts. If you have both these directories then use Analyst Data\Projects\API Instrument\Processing Scripts. If the file is already there then overwrite it.

Open your wiff file using Analyst and select the menu item Explore->Show->Show TIC.

Now select the menu item Script->Mascot to display a popup window.

Click the Options... button.

We suggest changing the Precursur mass tolerance for grouping value to 0.2, unchecking the De-isotope MS/MS data tick box and setting the Default precursor charge states to +2, +3 and +4. You might also want to consider +5 peaks depending on your application.

At this point you have a couple of options dependent on the browser you want to use. Protein Prospector probably works somewhat better with Mozilla Firefox than Internet Explorer in that some reports will display faster.

Analyst can automatically launch the Protein Prospector Batch-Tag Web form in the Internet Explorer browser if you select the Protein Prospector option in the Analyst popup. For the public web site the url to enter in the Protein Prospector box is http://prospector2.ucsf.edu/prospector/cgi-bin/msform.cgi?form=batchtagweb. For this to work you will also need to have an Internet Explorer browser open and to have logged in to one of programs in the Batch MSMS Database searching section of the Protein Prospector home page.

Click the OK button and then the Search button. If you chose the Internet Explorer option the Batch-Tag Web form should eventually be launched with the peak list file name in the Upload Data From File field.

This approach is somewhat limited for the following reasons:

1). It only works with the Internet Explorer browser.

2). Protein Prospector assigns project names for the data sets that you upload based on the name of the uploaded file. Analyst assigns names such as mas31 which are uninformative.

3). It only allows you to search one peak list at a time and doesn't allow you to upload the wiff file with the peak list file.

If you don't log in to Protein Prospector in Internet Explorer before pressing the Search button then Analyst will launch an Internet Explorer with a Protein Prospector Login page and display a popup window saying it could not navigate to the web site. However the peak list file will still be created. A typical place where the file is stored is Documents and Settings\<user name>\Local Settings\Temp or one of its subdirectories. The file will have a name like mas31.tmp. You could try launching a search window and looking for a file called mas*.tmp in Documents and Settings.

Once you have located the peak list files you can browse for them using the Browse... button on the Batch-Tag Web form. Alternatively you can copy them to a more convenient location and give them more meaningful names. As they are initially stored in a temporary directory then they will probably be deleted at some stage.

If you want to search multiple peak lists at the same time or upload wiff files with the peak list files then follow the instructions in the Upload Data From File section.

Details of the extractms program (also called lcq_dta) can be found on the sequest home page. If the raw data file is called F6032010.RAW the program will produce a series of dta files (one per MSMS scan) with names such as F6032010.1093.1095.2.dta. Here F6032010 is the fraction number, 1093 is the start scan number, 1095 is the end scan number and 2 is the precursor charge.

To search these dta files with Batch-Tag Web put them together in a zip file and select the zip file with the Batch-Tag Web Browse button. Protein Prospector will assign a project name for the search based on the name of the zip file.

You can include the dta files from multiple raw data files in the zip file. Protein Prospector will then indicate the fraction number as a column in the results.

If you require access to the raw data you can also include the raw data files in the zip file.

ReAdW can be used to convert Thermo RAW files to either mzXML or mZML files. Protein Prospector can search the MSMS peaklists in mzXML files. Typically you might run the program as follows:

ReAdW --mzXML -c data.RAW

This will produce a file called data.mzXML.

The -c option is used for centroiding the data. This is only necessary if you have acquired your data in profile mode. A -z flag can be used to compress the spectral data in the file. However the compression achieved by this is fairly marginal and it is best to compress the mzXML file itself before uploading it. Using the -g flag will produce a compressed file called data.mzXML.gz. If you are uploading multiple files then you will need to compress the archive rather than the individual files.

Mascot Distiller and PAVA produce mgf files with TITLE lines in the following format:

TITLE=Scan 5 (rt=0.0849267) [\\Ltqft\RAW\FT_June_2006\F6060902.RAW]
TITLE=Sum of 2 scans in range 36 (rt=0.676758) to 49 (rt=0.938248) [\\Ltqft\RAW\FT_June_2006\F6060902.RAW]
TITLE=1: Scan 195 (rt=5.6608) [d:\cygwin\home\F7010302.RAW]

Such mgf files can be uploaded along with RAW files and Protein Prospector will be able to locate the relevant raw data and perform quantitative analysis.

1. Lynn, A. J., Chalkley R. J., Baker P. R., Medzihradszky K. F., Guan S. and Burlingame A. L., The Effect of Peaklist Generation Software on Database Search Results, 56th ASMS Conference of Mass Spectrometry and Allied Topics, Denver, Colorado June 1st - June 5th 2008

The ABI 4000 series Explorer software has a function called Peaks To Mascot that will produce a peak list in mgf format. The resultant file has a TITLE line of the following format:

TITLE=Label: 59, Spot_Id: 41307, Peak_List_Id: 3210, MSMS Job_Run_Id: 14369, Comment:

Protein Prospector extracts the number in the Label field and the number in the MSMS Job Run ID field. The number in the Label field (59 in this case) will be displayed in the RT column of the Search Compare results page. The number in the MSMS Job Run ID field will be displayed in the R field. There is also a # field in the results which is a spectrum number. Each spectrum for a given label will have a different spectrum number.

Protein Prospector includes a program called Peak Spotter which can either be run from the command line or via a web page interface. Users who don't have a local copy of Protein Prospector can use the command line version of Peak Spotter to extract peak lists for submission by Batch-Tag Web from the TOFTOF Oracle database. It is also possible to extract T2D files to enable quantitative analysis. Details on running Peak Spotter can be found in the Protein Prospector Automation Manual.

To run Peak Spotter you also need to download and install the Oracle Client Software. The file tnsnames.ora needs to be configured so that the client can connect to the Oracle database. Information on configuring the file can be found on the on the Oracle FAQ web site.

When running Peak Spotter you select one or more spot sets and can also elect to extract the MS and/or the MSMS raw data along with the centroid data. If you are extracting data for uploading via Batch-Tag Web rather than for creating a central data repository the Peak Spotter parameters centroid_dir and raw_dir need to be set to the same value, say D:\TOFTOF.

For example if the spot set is called User Project 1/test then Peak Spotter will create a subdirectory called User Project 1 in D:\TOFTOF. The D:\TOFTOF\User Project 1 directory will contain a file called test.txt which is the centroid file. If you specified raw data extraction an additional subdirectory called test will be created containing the raw data.

A set of files for a typical spot set called test is shown below. To search this data set using Batch-Tag Web you should create an archive file containing these files. The project name by which this data set will subsequently be referenced will be set to the archive file name. The archive file can contain multiple spot sets is desired.

test.txt
test\13_MS_3-1.cal
test\13_MS_3.t2d
test\14_MS_3-1.cal
test\14_MS_3.t2d
test\15_MS_3-1.cal
test\15_MS_3.t2d
test\16_MSMS_540353_4-1.cal
test\16_MSMS_540353_4.t2d
test\16_MS_3-1.cal
test\16_MS_3.t2d
test\17_MSMS_540352_4-1.cal
test\17_MSMS_540352_4.t2d
test\17_MS_3-1.cal
test\17_MS_3.t2d
test\18_MS_3-1.cal
test\18_MS_3.t2d
test\19_MS_3-1.cal
test\19_MS_3.t2d
test\54_MS_1-1.cal
test\54_MS_1.t2d
test\55_MS_1-1.cal
test\55_MS_1.t2d
test\56_MSMS_540344_2-1.cal
test\56_MSMS_540344_2.t2d
test\56_MS_1-1.cal
test\56_MS_1.t2d
test\57_MSMS_540340_2-1.cal
test\57_MSMS_540340_2.t2d
test\57_MSMS_540343_2-1.cal
test\57_MSMS_540343_2.t2d
test\57_MS_1-1.cal
test\57_MS_1.t2d
test\58_MSMS_540341_2-1.cal
test\58_MSMS_540341_2.t2d
test\58_MSMS_540342_2-1.cal
test\58_MSMS_540342_2.t2d
test\58_MS_1-1.cal
test\58_MS_1.t2d
test\64_MS_1-1.cal
test\64_MS_1.t2d
test\65_MS_1-1.cal
test\65_MS_1.t2d
test\66_MS_1-1.cal
test\66_MS_1.t2d
test\67_MS_1-1.cal
test\67_MS_1.t2d

Each raw data scan has a t2d file and optionally a cal (calibration) file. An example MS t2d file name is 57_MS_1.t2d. Here 57 is the spot number and 1 is the run number. An example MS cal file name is 57_MS_1-1.cal. Here 57 is the spot number, the first 1 is the run number and the second 1 is the calibration file index (always 1 at present). An example MSMS t2d file name is 57_MSMS_540340_2.t2d. Here 57 is the spot number, 540340 is the job run item id and 2 is the run number. An example MSMS cal file name is 57_MSMS_540340_2-1.cal. Here 57 is the spot number, 540340 is the job run item id, 2 is the run number and 1 is the calibration file index (always 1 at present).

The centroid file test.txt contains both the MS and the MSMS scans.

An example MS scan from the file is shown below. It consists of a title line starting with the characters >MS1 and containing some parameters. This is followed by multiple lines containing m/z and intensity values for the fragment peaks.

>M1$Spot:57$Run:1$JobRunItem:540334$
818.7110 114.9
838.4725 228.7
844.5468 190.7
860.5470 815.2
864.9903 115.9
874.5552 158.3
878.4852 120.6
888.5871 159.6
904.5724 574.1
932.6037 133.6
948.6083 428.7
976.5941 126.9
981.5049 136.4
992.6191 467.9
1036.6511 258.3
1080.6537 197.0
1094.5872 181.0
1124.6823 201.2
1139.6127 139.1
1194.5724 98.7
1210.6232 198.9
1211.6467 191.1
1279.7203 105.9
1280.6852 128.2
1307.6901 153.3
1313.2075 95.9
1376.7445 109.4
1477.7676 163.1
1554.1926 1017.9
1554.6982 1579.3
1582.2468 97.7
1583.7482 120.4
1605.8782 104.1
1676.9181 122.9
2426.2119 152.2
2683.4561 141.6
2891.5688 170.5
2978.6150 411.4
3020.6753 393.7
3063.6936 413.5
3077.7769 344.5
3091.2310 313.1
3106.6538 929.6
3107.7161 15344.2
3114.7305 975.3
3123.6897 289.5
3141.7732 264.6
3163.8062 727.2

An example MSMS scan from the centroid file is shown below. It consists of a title line starting with the characters >MS2 and containing some parameters. This is followed by a line containing the precursor m/z and multiple lines containing m/z and intensity values for the peaks in the spectrum.

>M2$Spot:57$Run:2$JobRunItem:540343$
3107.8000
70.0816 391.2
72.6717 90.5
74.9886 91.9
76.2054 83.1
77.4012 129.5
77.9631 88.5
78.0609 103.0
78.6257 83.9
79.0727 86.7
79.6941 82.3
80.7677 90.3
82.4509 92.0
83.7145 88.8
84.0984 376.8
86.1141 198.2
88.8027 96.3
91.4154 91.0
92.6776 83.9
94.6804 86.5
94.8214 101.6
95.5519 92.3
95.8454 91.3
96.0071 83.9
96.6495 88.6
96.8328 88.0
98.5437 84.9
99.4693 95.9
99.5805 106.9
100.3745 106.4
100.7082 107.3
101.0621 93.6
101.1240 87.1
101.3665 87.2
101.4247 118.5
102.8414 103.3
103.0088 90.2
103.1642 81.7
103.2301 89.8
106.2024 100.3
108.4485 88.2
109.4319 85.4
109.9773 97.8
110.0892 182.4
112.1061 284.7
113.9891 84.1
129.1256 485.2
131.0847 81.9
155.1166 118.4
157.1366 335.4
158.6527 83.7
165.1560 87.6
168.1984 98.3
172.1391 126.6
174.1659 144.4
180.1846 83.7
181.1877 83.5
181.8209 89.6
182.8433 90.0
183.1806 117.5
185.1501 120.4
185.6591 79.1
186.1617 244.4
187.1458 120.2
193.1293 80.8
193.6262 87.3
201.1599 104.0
208.1953 107.9
211.1799 111.1
212.8313 79.7
215.1714 79.2
216.0361 87.3
223.1858 128.5
225.2194 113.4
226.2059 81.5
233.1702 77.8
239.1640 148.0
245.9914 77.6
248.1163 84.2
251.1915 317.1
252.1939 94.8
254.1845 115.1
256.3840 91.9
258.1999 82.2
265.2201 150.8
266.7413 82.8
267.2198 105.1
268.2281 739.1
272.2186 398.9
280.3919 85.9
283.2325 3151.8
285.2540 455.3
286.2549 132.1
288.1819 133.6
296.1924 319.6
298.2702 92.5
299.4575 80.9
299.9333 79.1
300.8977 86.9
302.2717 503.6
306.2146 80.5
319.1718 150.8
326.2861 138.1
327.2418 114.9
329.2429 82.9
331.5964 87.6
334.2427 82.2
336.2851 146.9
342.3468 80.2
345.2804 79.1
352.2661 200.4
353.2881 104.8
354.2908 726.0
357.2963 78.7
369.2757 162.0
370.2867 88.2
372.0383 77.8
373.2827 778.3
380.2877 96.2
381.2939 115.9
385.3689 117.0
387.2572 84.9
387.3638 95.6
395.3505 114.3
396.3374 671.1
398.3166 115.5
406.2528 115.4
408.3204 86.2
409.2921 247.2
412.3166 82.2
413.3773 1461.9
415.3443 158.2
416.3478 88.7
424.3186 152.3
425.3545 120.9
426.3145 121.3
430.3989 108.0
434.8144 82.2
439.3230 90.6
440.3459 118.4
441.3409 82.8
442.1048 91.3
442.4216 211.9
451.3571 82.2
452.3797 106.0
453.1074 87.6
453.3752 188.0
455.3479 118.8
456.3493 130.5
458.3658 150.7
465.3470 96.4
466.3655 100.1
467.3860 210.4
469.3611 98.5
470.4068 815.4
480.2912 99.8
481.3954 85.5
482.3743 95.3
483.3530 785.5
486.3856 337.4
488.4195 2321.1
490.4154 138.6
492.3656 88.5
496.3865 88.2
499.4370 450.4
500.4361 143.0
506.6184 93.0
509.4129 130.6
510.4031 303.5
522.3859 129.1
524.4190 194.0
526.4268 241.3
527.4391 1347.1
529.0041 98.9
534.4095 123.0
536.4167 241.0
537.4172 278.7
538.4261 226.2
540.3975 95.0
541.3989 98.0
542.3872 112.0
543.4651 890.4
545.3925 171.9
551.3619 100.4
552.4116 396.8
553.4452 819.9
568.4350 113.3
569.4544 371.1
570.4100 789.0
571.4689 1798.4
579.3866 104.5
588.4965 455.0
589.4883 1065.9
596.4574 255.5
597.4323 187.4
599.4660 106.9
611.5179 116.1
620.4871 181.2
621.4377 149.6
623.4743 604.4
635.5037 143.0
636.5139 194.7
638.4978 143.7
641.4813 455.6
646.5198 177.4
647.5313 194.2
649.4849 144.0
651.4838 394.8
656.4097 123.0
663.5172 863.6
664.5204 723.4
665.5067 404.3
667.5187 195.3
669.4771 1464.8
673.4802 153.9
674.4850 147.1
680.5444 2489.2
683.5039 177.3
690.5580 155.4
691.5173 1409.9
694.5269 144.1
707.5475 282.1
708.5494 6143.1
712.5009 157.1
713.5298 223.7
721.2798 139.2
724.5475 208.8
725.5112 263.4
726.5636 947.1
734.6071 171.5
736.5535 466.9
738.5679 227.7
740.5203 1109.9
754.5662 174.1
757.5857 153.5
761.5527 220.8
770.6374 159.3
776.6151 1436.6
779.6152 263.3
781.5583 478.5
782.5583 401.4
783.5634 183.6
786.6204 197.9
793.6479 3252.6
796.8444 169.7
804.6259 2003.3
809.6254 176.2
821.6472 4014.9
827.5793 664.6
834.7109 228.4
838.6566 294.7
839.6672 2152.4
850.6635 456.0
853.6368 313.8
861.6659 490.9
876.7091 184.6
877.6795 466.5
878.6816 2373.6
881.7071 266.5
887.6740 469.2
894.6989 1655.3
904.6887 1349.9
905.6763 1503.1
909.6752 302.8
912.6490 224.4
920.6944 763.7
922.7094 2766.5
937.7177 201.5
939.7296 434.3
940.7097 1861.2
948.7070 216.2
963.7479 185.5
964.7260 494.5
965.7428 573.8
967.7164 211.2
968.6846 238.3
973.6597 189.6
974.7150 470.4
979.7427 355.9
981.7464 1558.9
991.7321 1327.3
993.7258 681.1
1007.7363 515.1
1009.7472 2812.3
1022.7801 423.4
1027.7549 735.7
1035.7804 205.9
1037.7865 246.9
1041.7332 344.9
1048.8053 200.3
1053.8173 286.4
1059.8074 241.7
1060.8302 278.3
1077.8153 2053.9
1080.8195 403.3
1087.7915 419.6
1092.8397 244.8
1094.8405 3787.4
1097.4357 275.1
1104.8356 336.6
1105.8196 2057.4
1121.8464 348.3
1122.8386 3569.9
1132.8738 341.9
1134.8398 248.1
1135.7661 222.3
1140.8470 477.1
1148.8553 1776.0
1151.8623 860.7
1158.8309 266.9
1165.8761 2593.6
1175.9073 411.5
1176.8569 4346.4
1192.8870 394.6
1193.8795 4447.1
1210.9226 323.8
1217.9330 222.0
1218.9121 246.1
1234.9608 332.9
1243.9415 217.4
1244.9635 248.3
1245.9500 452.6
1247.5428 245.4
1247.9064 611.6
1261.9563 3538.6
1264.9368 1442.4
1271.9171 430.2
1278.9728 6956.6
1282.6344 225.3
1283.6104 206.8
1289.9479 7346.3
1304.9622 204.1
1306.9674 7842.4
1324.9724 1742.1
1359.9382 250.8
1377.0360 184.1
1387.0062 579.0
1404.0372 891.0
1417.0232 193.8
1422.0396 582.7
1425.9580 180.0
1433.0537 457.0
1434.0682 584.3
1444.0348 633.6
1452.2704 162.1
1461.0468 3089.6
1478.0739 1360.8
1481.9540 259.3
1572.1329 1785.7
1589.1396 2914.0
1607.1584 587.2
1615.1604 401.1
1632.1760 802.0
1643.1538 759.4
1660.1827 937.7
1678.2063 1093.8
1744.1943 432.5
1745.2018 956.0
1761.2327 2015.0
1772.1965 1427.0
1789.2192 14183.9
1794.1674 377.5
1802.0072 2733.8
1807.2106 621.2
1831.2717 313.2
1846.2040 297.1
1848.2517 908.0
1852.1343 280.0
1858.2681 490.3
1874.2438 407.2
1876.2451 780.0
1895.3079 293.5
1915.0902 376.7
1930.2916 668.9
1947.3232 1534.8
1950.2249 369.0
1958.3010 637.9
1975.3192 778.9
1986.1882 284.6
1994.2582 246.1
2001.3354 289.1
2018.3698 691.2
2029.3331 558.9
2046.3420 1040.6
2064.3562 608.2
2105.4048 367.8
2115.3901 210.1
2131.3364 342.5
2135.4883 193.2
2151.3501 278.8
2186.2456 242.8
2201.5425 358.2
2202.4258 379.3
2204.3252 221.3
2218.4795 744.5
2229.4578 276.8
2246.4536 462.3
2264.4387 551.3
2286.4614 176.8
2288.3088 203.0
2302.5608 209.7
2303.4902 604.6
2319.5139 447.8
2329.4463 290.9
2345.5098 272.4
2365.4600 370.4
2390.5286 216.9
2400.4087 301.6
2401.2710 247.3
2404.6440 175.5
2406.5107 355.1
2409.4185 210.0
2416.4778 259.5
2418.3235 235.8
2433.4561 230.8
2434.5229 458.2
2452.5398 325.5
2534.5911 507.1
2537.3821 1239.1
2545.5405 281.3
2562.5515 686.6
2580.5818 340.2
2604.6902 187.9
2615.7002 168.6
2620.5854 204.6
2631.5867 506.2
2637.6226 341.9
2638.5334 694.3
2639.4980 663.2
2648.6875 209.1
2663.5813 346.3
2665.5854 991.3
2683.5642 312.8
2695.4773 502.7
2724.6736 344.9
2752.6106 306.0
2823.5828 430.6
2888.8936 266.0
2907.0676 252.8
2927.9209 267.9
3023.1343 658.9
3049.8591 783.8
3067.5789 1017.0
3068.8738 351.3
3083.6628 420.5
3097.8208 5141.1
3108.1621 1818.8
3114.6418 22896.7


Batch-Tag results are stored in XML files. Once started each search is given a unique 16 letter key. For example if the key was SGp8ChbZMNHZXEa0 then the results file name would be SGp8ChbZMNHZXEa0.xml. An example of the path to the file from the home user repository directory is t/p/tphrox9dgr/batchtag/results/2010_05. Here tphrox9dgr is the user's home directory (a 10 letter key). The t and p directories are formed from the first 2 letters of this home directory. 2010_05 is the year and month when the search started.

Some searches also have an expectation value search associated with them. The results files for these have the same file format as a normal results file. However they are stored in a different place in the repository. An example of the path to an expecation value results file is t/p/tphrox9dgr/batchtag/project/2010_05. If the project file is called test.xml then the expectation results files will have names like test.exp.1.xml, test.exp.2.xml, etc. Note that they are stored in the same directory as the project file which may have a different month element to the directory path than the results files. This month element will reflect the date when the project was created.

Batch-Tag results files only represent a stage in the processing of a data set. Search Compare further processes these files to present the results in a manner suitable for a typical user.

An example results file from a 2 spectrum data set is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<?Thu May 06 11:21:31 2010, ProteinProspector Version 5.5.1?>
<batchtag_report>
<parameters>
<allow_non_specific>at%200%20termini</allow_non_specific>
<const_mod>Carbamidomethyl%20%28C%29</const_mod>
<data_source>List%20of%20Files</data_source>
<database>SwissProt.2010.03.30</database>
<dna_frame_translation>3</dna_frame_translation>
<enzyme>Trypsin</enzyme>
<expect_calc_method>Linear%20Tail%20Fit</expect_calc_method>
<expect_coeff_file>test.exp.1</expect_coeff_file>
<fragment_masses_tolerance>300</fragment_masses_tolerance>
<fragment_masses_tolerance_units>ppm</fragment_masses_tolerance_units>
<full_pi_range>1</full_pi_range>
<high_pi>10.0</high_pi>
<input_filename>lastres</input_filename>
<input_program_name>msfit</input_program_name>
<instrument_name>ESI-Q-TOF</instrument_name>
<low_pi>3.0</low_pi>
<max_hits>9999999</max_hits>
<missed_cleavages>1</missed_cleavages>
<mod_c_term_type>Peptide</mod_c_term_type>
<mod_defect>0.00048</mod_defect>
<mod_end_nominal>100</mod_end_nominal>
<mod_n_term_type>Peptide</mod_n_term_type>
<mod_start_nominal>-100</mod_start_nominal>
<msms_full_mw_range>1</msms_full_mw_range>
<msms_max_modifications>2</msms_max_modifications>
<msms_max_reported_hits>5</msms_max_reported_hits>
<msms_mod_AA>Acetyl%20%28Protein%20N-term%29</msms_mod_AA>
<msms_mod_AA>Acetyl%2BOxidation%20%28Protein%20N-term%20M%29</msms_mod_AA>
<msms_mod_AA>Gln-%3Epyro-Glu%20%28N-term%20Q%29</msms_mod_AA>
<msms_mod_AA>Met-loss%20%28Protein%20N-term%20M%29</msms_mod_AA>
<msms_mod_AA>Met-loss%2BAcetyl%20%28Protein%20N-term%20M%29</msms_mod_AA>
<msms_mod_AA>Oxidation%20%28M%29</msms_mod_AA>
<msms_parent_mass_systematic_error>0</msms_parent_mass_systematic_error>
<msms_parent_mass_tolerance>200</msms_parent_mass_tolerance>
<msms_parent_mass_tolerance_units>ppm</msms_parent_mass_tolerance_units>
<msms_precursor_charge_range>2%203</msms_precursor_charge_range>
<msms_prot_high_mass>125000</msms_prot_high_mass>
<msms_prot_low_mass>1000</msms_prot_low_mass>
<parent_mass_convert>monoisotopic</parent_mass_convert>
<report_title>BatchTag</report_title>
<script_filename>script</script_filename>
<search_key>GmI76WjMuhVloLnK</search_key>
<search_name>batchtag</search_name>
<species>All</species>
<use_instrument_ion_types>1</use_instrument_ion_types>
<version>5.5.1</version>
</parameters>
<pre_search_results>
<num_entries>432448</num_entries>
<num_final_indicies>432448</num_final_indicies>
</pre_search_results>
<d>
1-84.187-1-1-2881,2	39	642.2891	3	100
0	0	11299
1	13	-73.882	EQNPFVHEMLEALVIR	896	40.7	23.3	P40069
2	25	-54.982	Q(Gln->pyro-Glu)NKMTGIGEQYQYILR	210	19.3	1.8	Q9Z6S4
3	23	-47.057	IQYTSEQNM(Oxidation)ENALVIR	94	18.2	0.8	O73819
4	24	-79.718	MLYRTNYIGLVGGGASPR	67	18.1	0.7	Q75AQ4
5	25	-26.760	DQQM(Oxidation)GHTQAGPQRADLR	243	18.0	0.6	Q9I7C3
+243	A5VWC0
+243	B7V0N8
+242	P13456
+243	Q88BK1
+243	Q88RW7
+243	Q500U5
+243	Q3KKF9
+243	Q4KKS8
+243	Q48QJ8
+243	A6UX64
+243	Q1IH46
+243	B1J3Y4
+243	B0KEV1
</d>
<d>
1-86.237-1-1-2985,2:2992,2	40	901.3781	3	100
0	0	7300
1	9	-64.680	SDAFAEFAEPLVNSAYEAIKTDSAR	565	46.6	32.8	P40069
2	22	-22.435	WEFC(Carbamidomethyl)FDGEPYFILC(Carbamidomethyl)ATPGHEAR	142	16.1	2.3	P94382
3	25	-91.366	LPVLLTPGVTTC(Carbamidomethyl)AGDATSSAAASDLSAR	383	15.5	1.7	Q3BTK6
4	22	-79.391	EILAQYGLHETDTGSPEAQVAMLTK	10	14.8	1.0	B1MD64
5	21	-43.156	M(Met-loss)GSELESAM(Oxidation)ETLINVFHAHSGQEGDK	1	14.7	0.8	P56565
</d>
</batchtag_report>

The file consists of 3 sections:

1). A parameters section containing the settings from the search form as name-value pairs and one or two other settings such as the search key and the software version number.

2). A pre-search results section detailing the number of database entries selected by each phase of the pre-search.

3). The results and information for each spectrum in the data set.

The results and information for each spectrum in the data set is contained between <d> tags. For example:

<d>
1-86.237-1-1-2985,2:2992,2	40	901.3781	3	100
0	0	7300
1	9	-64.680	SDAFAEFAEPLVNSAYEAIKTDSAR	565	46.6	32.8	P40069
2	22	-22.435	WEFC(Carbamidomethyl)FDGEPYFILC(Carbamidomethyl)ATPGHEAR	142	16.1	2.3	P94382
3	25	-91.366	LPVLLTPGVTTC(Carbamidomethyl)AGDATSSAAASDLSAR	383	15.5	1.7	Q3BTK6
4	22	-79.391	EILAQYGLHETDTGSPEAQVAMLTK	10	14.8	1.0	B1MD64
5	21	-43.156	M(Met-loss)GSELESAM(Oxidation)ETLINVFHAHSGQEGDK	1	14.7	0.8	P56565
</d>

The first line contains the spectrum information:

1-86.237-1-1-2985,2:2992,2	40	901.3781	3	100

The space characters in this line are tabs.

The first group (1-86.237-1-1-2985,2:2992,2) may be decoded as follows:

Fraction Number-Spot Number-Run Number- Spectrum Number-Optional Raw Data Field

The fraction number (here 1) is the numerical index of the centroid file within the project file.

The spot number (here 86.237) identifies the position of the spectrum within the fraction. It is typically either a retention time or spot number.

The run number (here 1) is usually 1. It may be higher on a TOFTOF instrument if several runs are done on the same spot.

The spectrum number (here 1) has a different meaning dependent on the instrument. On a TOFTOF it is a counter for different spectra on the same spot. On other instruments it is generally set to 1. If multiple spectra have the same recorded retention time then they will each be given a different run number (1, 2, 3, 4, etc). If the precursor charge is ambiguous then 10000 multiplied by the charge is added to the original run number. Each separate precursor charge will have its own set of results between <d> tags. For example if the Precursor Charge Range is set to 2 3 4 the run numbers corresponding to each charge state would be 20001, 30001 and 40001.

The optional raw data field (here 2985,2:2992,2) is used to store information to allow the raw data to be extracted. The example given is from the ABI Q-STAR where a list of spectra to be summed to give the MSMS raw data scan is listed. The individual spectra are separated by colons (:). For each spectrum the cycle number is separated from the experiment number by a comma. For the ABI TOFTOF the job run item id for the MSMS spectrum is output here.

The next number on the spectrum information line is the number of peaks used for the search (40 in this case).

This is followed by the precursor m/z (901.3781), the precursor charge (3) and the precursor intensity (100).

The second line contains some numbers for calculating the expectation value. This line is not present if the Expectation Calc Method is set to None.

0	0	7300

The first 2 numbers are the gradient and offset of a linear fit to the tail of the score distribution. These numbers will be zero for a normal search. They are only set for an expecation value search.

The third number (7300) is the number of peptides which get through the precursor mass filter.

The following lines then contain information about the hits in rank order.

1	9	-64.680	SDAFAEFAEPLVNSAYEAIKTDSAR	565	46.6	32.8	P40069

The number 1 is the rank number of the hit.

The number 9 is the number of unmatched peaks.

The number -64.680 is the delta mass (in the specified tolerance units).

The peptide sequence SDAFAEFAEPLVNSAYEAIKTDSAR is the hit peptide sequence. It can be preceded by an optional n-terminal modification or followed by optional an c-terminal modification and/or an optional neutral loss modification. Examples showing an n-terminal modification, a c-terminal modification and a neutral loss modification are shown below. The N-terminal and C-terminal modifications are preceded by a dash (-) character. The neutral loss modification is preceded by a plus (+) character.

-Acetyl	MEEVVIAGMSGK
EITALAPSTMK	-Label:18O(2)
VTVSLWQGETQVASGTAPFGGEIIDER	+Cation:N(1)H(4)

The number 565 is the start amino acid of the peptide within the hit protein.

The number 46.6 is the peptide score for the given peptide hit.

The number 32.8 is the score difference for the given peptide hit. This is the score difference between this peptide and the next best scoring peptide.

The next field P40069 contains the accession number. Note if multiple databases are searched then the accession number will be written as say 2$P40069 where 2 is the database index number.

If the hit peptide sequence occurs several times in the database then the result is written as shown below:

5	25	-26.760	DQQM(Oxidation)GHTQAGPQRADLR	243	18.0	0.6	Q9I7C3
+243	A5VWC0
+243	B7V0N8
+242	P13456
+243	Q88BK1
+243	Q88RW7
+243	Q500U5
+243	Q3KKF9
+243	Q4KKS8
+243	Q48QJ8
+243	A6UX64
+243	Q1IH46
+243	B1J3Y4
+243	B0KEV1

After the first occurrence is listed the lines for the other occurances start with a + sign. The two fields are then the start amino acid and the accession number (eg 243 and A5VWC0).

If peptide hits have the same score then their ranks will be the same. Eg:

3	17	-15.196	LTRYSQGDDDGS(Phospho)SSSGGSSVAGSQSTLFK	8	41.0	0.0	Q9UH99
3	17	-15.196	LTRYSQGDDDGSS(Phospho)SSGGSSVAGSQSTLFK	8	41.0	0.0	Q9UH99
3	17	-15.196	LTRYSQGDDDGSSS(Phospho)SGGSSVAGSQSTLFK	8	41.0	0.0	Q9UH99
3	17	-15.196	LTRYSQGDDDGSSSS(Phospho)GGSSVAGSQSTLFK	8	41.0	0.0	Q9UH99

Versions after 5.8.0 incorporate SLIP (Site Localization in Peptide) scoring. To accomodate this extra hits relating to alternative modification positions are stored for each spectrum when necessary. For example:

<d>
1-557-1-1	40	1062.8097	3	17602
0	0	190
1	30	-38.767	DSTFNVFVGKGQLITGMDQALVGMC(Carbamidomethyl)VNER	78	9.3	2.1	O95302
2	28	-47.865	LVEKQNPAEGLQTLGAQM(Oxidation)QGGFGC(Carbamidomethyl)GNQLPK	3810	9.2	2.0	Q8NEZ4
3	26	-48.708	MPSLPQEGVIQGPSPLDLNTELPYQSTM(Oxidation)K	1	7.8	0.6	Q6ZT98
4	30	-13.191	ELNKQESASDMTSTFPVAQSLTPGSM(Oxidation)EER	702	7.3	0.1	Q8N0Z3
5	30	-25.029	DLQANDTGRYFC(Carbamidomethyl)LAANDQNNVTIM(Oxidation)ANLK	486	7.3	0.1	P32004
M	35	M(Oxidation)PSLPQEGVIQGPSPLDLNTELPYQSTMK	2.5
M	31	ELNKQESASDM(Oxidation)TSTFPVAQSLTPGSMEER	5.0
</d>

The extra hits have an M in place of the rank number. In this example the 3rd and 4th hits have extra hits listed corresponding to different positions for the Oxidation modification. The second field for these extra hits corresponds to the number of unmatched peaks. The third field is the peptide sequence and the fourth field the peptide score. There could be more than one of these extra hits for a given ranked hit if there are multiple ambiguous modifications in the peptide.


When you do a Batch-Tag search one of the requirements is a set of MSMS data to search. This is called a project and consists of a set of peak list files and optionally their associated raw data files. Multiple Batch-Tag searches can be done on the same project. The information about a project is stored in a project file which is an xml file.

Projects Created By Batch-Tag Web

If Batch-Tag Web is used the data for the search is uploaded by the search form and stored in the user repository. An example of a path to the data is t/p/tphrox9dgr/batchtag/data/2010_05/G65vkVMBukoVVd57. Here tphrox9dgr is the user's home directory (a 10 letter key). The t and p directories are formed from the first 2 letters of this home directory. 2010_05 is the year and month when the project was created. G65vkVMBukoVVd57 is the 16 letter search key used by the first search performed on the data. The peak list files and optionally the raw data files are stored in this directory. All peak lists are converted to mgf format.

The project file for this data would be stored in the t/p/tphrox9dgr/batchtag/project/2010_05 directory. An example name for the file is project1.xml where project1 is the name of the project. A example file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Aug 16 13:24:01 2011, ProteinProspector Version 5.10.0?>
<project>
<project_name>project1</project_name>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-26.mgf</centroid>
<num_msms_spectra>205</num_msms_spectra>
<centroid_name>V20110430-26</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-26.RAW</raw>
</file>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-27.mgf</centroid>
<num_msms_spectra>86</num_msms_spectra>
<centroid_name>V20110430-27</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-27.RAW</raw>
</file>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-28.mgf</centroid>
<num_msms_spectra>279</num_msms_spectra>
<centroid_name>V20110430-28</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-28.RAW</raw>
</file>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-29.mgf</centroid>
<num_msms_spectra>291</num_msms_spectra>
<centroid_name>V20110430-29</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-29.RAW</raw>
</file>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-30.mgf</centroid>
<num_msms_spectra>386</num_msms_spectra>
<centroid_name>V20110430-30</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-30.RAW</raw>
</file>
<file>
<centroid>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-31.mgf</centroid>
<num_msms_spectra>380</num_msms_spectra>
<centroid_name>V20110430-31</centroid_name>
<raw>#t/p/tphrox9dgr/batchtag/data/2011_08/lYGk4as5uMrMcoE1/V20110430-31.RAW</raw>
</file>
</project>

The information on a particular file is contained in a <file> block. The <centroid> block gives the path to the file. The # sign at the start of the path signifies the user repository. This allows the user repository to be moved without affecting the project file. The <num_msms_spectra> block stores the number of msms spectra in the file. This is important as by default searches are split into blocks of 500 spectra and will span multiple files. The <centroid_name> block is the name used to refer to a particular fraction in the results. This will generally correspond to the file name but doesn't have to. The <raw> block is optional and gives the path to the raw data file. Note that it is possible to manually edit a project file to add raw data to a project after the project has been created.

Projects Created By Make Project

If the search is done on data in the data repository using Batch-Tag rather than Batch-Tag Web then the files for the project are specified by the Make Project form. In this case the <centroid> block starts with a $ character to specify data repository. The project file is still created in the user repositoty. An example is given below:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Aug 16 13:24:01 2011, ProteinProspector Version 5.10.0?>
<project>
<project_name>project2</project_name>
<file>
<centroid>$QStar1/2011/05/file1.mgf</centroid>
<num_msms_spectra>681</num_msms_spectra>
<centroid_name>file1</centroid_name>
<raw>$QStar1/2011/05/file1.wiff</raw>
</file>
<file>
<centroid>$QStar1/2011/05/file2.mgf</centroid>
<num_msms_spectra>942</num_msms_spectra>
<centroid_name>file2</centroid_name>
<raw>$QStar1/2011/05/file2.wiff</raw>
</file>
<file>
<centroid>$QStar1/2011/05/file3.mgf</centroid>
<num_msms_spectra>663</num_msms_spectra>
<centroid_name>file3</centroid_name>
<raw>$QStar1/2011/05/file3.wiff</raw>
</file>
<file>
<centroid>$QStar1/2011/05/file4.mgf</centroid>
<num_msms_spectra>711</num_msms_spectra>
<centroid_name>file4</centroid_name>
<raw>$QStar1/2011/05/file4.wiff</raw>
</file>
</project>

Projects Files when Running Batch-Tag from the Command Line

A further option is to run Batch-Tag from the command line. In this case the path to the project file is specified as a parameter passed to the program. A typical file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Aug 16 13:24:01 2011, ProteinProspector Version 5.10.0?>
<project>
<project_name>F1</project_name>
<file>
<centroid>/home/user/x/F1.mgf</centroid>
<num_msms_spectra>271</num_msms_spectra>
<centroid_name>F1</centroid_name>
</file>
</project>

In this case the full path to the centroid file is given. Also it is the user's responsibility to calculate the number of spectra in a file.

Calibrated Projects

Projects can also be calibrated to enable a systematic offset to be applied to all precursor m/z values in a peak list file. The units for the offset are specified in a <msms_parent_mass_tolerance_units> block near the top of the file. An <offset> block can then be specified for each file. If you calibrate a project from the Search Compare form a new project and project file are created. If the original project was called project1 the new project will be called project1.cal.1. Any subsequent calibrations done on the original project will have names such as project1.cal.2, etc. A calibrated project cannot be calibrated a second time. If you wish to use a calibrated project from the command line the project name and file need to have a .cal.1 suffix. An example of a calibrated project file is given below:

<?xml version="1.0" encoding="UTF-8"?>
<?Tue Aug 16 13:24:01 2011, ProteinProspector Version 5.10.0?>
<project>
<project_name>nucleoporin.cal.1</project_name>
<msms_parent_mass_tolerance_units>ppm</msms_parent_mass_tolerance_units>
<file>
<centroid>$others/F15uLUCSF.mgf</centroid>
<num_msms_spectra>396</num_msms_spectra>
<centroid_name>F15uLUCSF</centroid_name>
<raw>$others/F15uLUCSF.wiff</raw>
<offset>-8.846</offset>
</file>
<file>
<centroid>$others/F25uLUCSF.mgf</centroid>
<num_msms_spectra>851</num_msms_spectra>
<centroid_name>F25uLUCSF</centroid_name>
<raw>$others/F25uLUCSF.wiff</raw>
<offset>-17.95</offset>
</file>
<file>
<centroid>$others/F35uLUCSF.mgf</centroid>
<num_msms_spectra>519</num_msms_spectra>
<centroid_name>F35uLUCSF</centroid_name>
<raw>$others/F35uLUCSF.wiff</raw>
<offset>-24.34</offset>
</file>
<file>
<centroid>$others/F45uLUCSF.mgf</centroid>
<num_msms_spectra>951</num_msms_spectra>
<centroid_name>F45uLUCSF</centroid_name>
<raw>$others/F45uLUCSF.wiff</raw>
<offset>-62.38</offset>
</file>
<file>
<centroid>$others/F55uLUCSF.mgf</centroid>
<num_msms_spectra>765</num_msms_spectra>
<centroid_name>F55uLUCSF</centroid_name>
<raw>$others/F55uLUCSF.wiff</raw>
<offset>-40.00</offset>
</file>
<file>
<centroid>$others/F65uLUCSF.mgf</centroid>
<num_msms_spectra>564</num_msms_spectra>
<centroid_name>F65uLUCSF</centroid_name>
<raw>$others/F65uLUCSF.wiff</raw>
<offset>-48.11</offset>
</file>
</project>