matchPattern {Biostrings}R Documentation

String searching functions

Description

A set of functions for finding all the occurences (aka "matches" or "hits") of a given pattern (typically short) in a (typically long) reference sequence or set of reference sequences (aka the subject)

Usage

  matchPattern(pattern, subject, algorithm="auto",
               max.mismatch=0, with.indels=FALSE, fixed=TRUE)
  countPattern(pattern, subject, algorithm="auto",
               max.mismatch=0, with.indels=FALSE, fixed=TRUE)
  vmatchPattern(pattern, subject, algorithm="auto",
                max.mismatch=0, with.indels=FALSE, fixed=TRUE)
  vcountPattern(pattern, subject, algorithm="auto",
                max.mismatch=0, with.indels=FALSE, fixed=TRUE)

Arguments

pattern The pattern string.
subject An XString, XStringViews or MaskedXString object for matchPattern and countPattern.
An XStringSet or XStringViews object for vmatchPattern and vcountPattern.
algorithm One of the following: "auto", "naive-exact", "naive-inexact", "boyer-moore", "shift-or" or "indels".
max.mismatch The maximum number of mismatching letters allowed (see isMatchingAt for the details). If non-zero, an inexact matching algorithm is used.
with.indels If TRUE then indels are allowed. In that case max.mismatch is interpreted as the maximum "edit distance" allowed between the pattern and a match. Note that in order to avoid pollution by redundant matches, only the "best local matches" are returned. Roughly speaking, a "best local match" is a match that is locally both the closest (to the pattern P) and the shortest. More precisely, a substring S' of the subject S is a "best local match" iff:
       (a) nedit(P, S') <= max.mismatch
       (b) for every substring S1 of S':
               nedit(P, S1) > nedit(P, S')
       (c) for every substring S2 of S that contains S':
               nedit(P, S2) <= nedit(P, S')
One nice property of "best local matches" is that their first and last letters are guaranteed to be aligned with letters in P (i.e. they match letters in P).
fixed If FALSE then IUPAC extended letters are interpreted as ambiguities (see isMatchingAt for the details).

Details

Available algorithms are: ``naive exact'', ``naive inexact'', ``Boyer-Moore-like'', ``shift-or'' and ``indels''. Not all of them can be used in all situations: restrictions depend on the length of the pattern, the class of the subject, and the values of max.mismatch, with.indels and fixed. All those parameters form the search criteria.

Note that the choice of an algorithm is not part of the search criteria. This is because algorithms are interchangeable, that is, if 2 different algorithms are compatible with a given search criteria, then choosing one over the other will not affect the result (but will most likely affect the performance). So there is no "wrong choice" of algorithm (strictly speaking).

Using algorithm="auto" is recommended because then the fastest algorithm will automatically be picked up among the set of compatible algorithms (if there is more than one).

Value

An XStringViews object for matchPattern.
A single integer for countPattern.
An MIndex object for vmatchPattern.
An integer vector for vcountPattern, with each element in the vector corresponding to the number of matches in the corresponding element of subject.

Note

Use matchPDict if you need to match a (big) set of patterns against a reference sequence.

Use pairwiseAlignment if you need to solve a (Needleman-Wunsch) global alignment, a (Smith-Waterman) local alignment, or an (ends-free) overlap alignment problem.

See Also

matchPDict, pairwiseAlignment, isMatchingAt, mismatch, matchLRPatterns, matchProbePair, maskMotif, alphabetFrequency, XStringViews-class, MIndex-class

Examples

  ## ---------------------------------------------------------------------
  ## A. matchPattern()/countPattern()
  ## ---------------------------------------------------------------------

  ## A simple inexact matching example with a short subject:
  x <- DNAString("AAGCGCGATATG")
  m1 <- matchPattern("GCNNNAT", x)
  m1
  m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
  m2
  as.matrix(m2)

  ## With DNA sequence of yeast chromosome number 1:
  data(yeastSEQCHR1)
  yeast1 <- DNAString(yeastSEQCHR1)
  PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
  match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
  match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE)

  ## With a genome containing isolated Ns:
  library(BSgenome.Celegans.UCSC.ce2)
  chrII <- Celegans[["chrII"]]
  alphabetFrequency(chrII)
  matchPattern("N", chrII)
  matchPattern("TGGGTGTCTTT", chrII) # no match
  matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

  ## Using wildcards ("N") in the pattern on a genome containing N-blocks:
  library(BSgenome.Dmelanogaster.UCSC.dm3)
  chrX <- maskMotif(Dmelanogaster$chrX, "N")
  as(chrX, "XStringViews") # 4 non masked regions
  matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE)
  ## Can also be achieved with no mask:
  masks(chrX) <- NULL
  matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

  ## ---------------------------------------------------------------------
  ## B. vmatchPattern()/vcountPattern()
  ## ---------------------------------------------------------------------

  Ebox <- DNAString("CANNTG")
  subject <- Celegans$upstream5000
  mindex <- vmatchPattern(Ebox, subject, fixed=FALSE)
  count_index <- countIndex(mindex)  # Get the number of matches per
                                     # subject element.
  sum(count_index)  # Total number of matches.
  table(count_index)
  i0 <- which(count_index == max(count_index))
  subject[i0]  # The subject element with most matches.

  ## The matches in 'subject[i0]' as an IRanges object:
  mindex[[i0]]
  ## The matches in 'subject[i0]' as an XStringViews object:
  Views(subject[[i0]], mindex[[i0]])

  ## ---------------------------------------------------------------------
  ## C. WITH INDELS
  ## ---------------------------------------------------------------------
  library(BSgenome.Celegans.UCSC.ce2)
  pattern <- DNAString("ACGGACCTAATGTTATC")
  subject <- Celegans$chrI

  ## Allowing up to 2 mismatching letters doesn't give any match:
  matchPattern(pattern, subject, max.mismatch=2)

  ## But allowing up to 2 edit operations gives 3 matches:
  system.time(m <- matchPattern(pattern, subject, max.mismatch=2, with.indels=TRUE))
  m

  ## pairwiseAlignment() returns the (first) best match only:
  if (interactive()) {
    mat <- nucleotideSubstitutionMatrix(match=1, mismatch=0, baseOnly=TRUE)
    ## Note that this call to pairwiseAlignment() will need to
    ## allocate 733.5 Mb of memory (i.e. length(pattern) * length(subject)
    ## * 3 bytes).
    system.time(pwa <- pairwiseAlignment(pattern, subject, type="local",
                                         substitutionMatrix=mat,
                                         gapOpening=0, gapExtension=1))
    pwa
  }

  ## Only "best local matches" are reported:
    ## - with deletions in the subject
  subject <- BString("ACDEFxxxCDEFxxxABCE")
  matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
  matchPattern("ABCDEF", subject, max.mismatch=2)
    ## - with insertions in the subject
  subject <- BString("AiBCDiEFxxxABCDiiFxxxAiBCDEFxxxABCiDEF")
  matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
  matchPattern("ABCDEF", subject, max.mismatch=2)
    ## - with substitutions (note that the "best local matches" can introduce
    ##   indels and therefore be shorter than 6)
  subject <- BString("AsCDEFxxxABDCEFxxxBACDEFxxxABCEDF")
  matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
  matchPattern("ABCDEF", subject, max.mismatch=2)

[Package Biostrings version 2.12.9 Index]