2015-04-15

算法

1 小时读完 (大约 7565 个字)

(转载)Boyer-Moore(字符串匹配)

引言
概念
代码
代码解释

引言

KMP它并不是效率最高的算法，实际采用并不多。各种文本编辑器的”查找”功能（Ctrl+F），大多采用Boyer-Moore算法。

Boyer-Moore算法不仅效率高，而且构思巧妙，容易理解。1977年，德克萨斯大学的Robert S. Boyer教授和J Strother Moore教授发明了这种算法。
在用于查找子字符串的算法当中，BM（Boyer-Moore）算法是目前被认为最高效的字符串搜索算法。一般情况下，比KMP算法快3-5倍。该算法常用于文本编辑器中的搜索匹配功能，比如大家所熟知的GNU grep命令使用的就是该算法，这也是GNU grep比BSD grep快的一个重要原因。

概念

Moore教授例子

根据Moore教授自己的例子来解释这种算法。

假定字符串（文本串）为”HERE IS A SIMPLE EXAMPLE”，搜索词（模式串）为”EXAMPLE”。

首先，”字符串”与”搜索词”头部对齐，从尾部开始比较。
这是一个很聪明的想法，因为如果尾部字符不匹配，那么只要一次比较，就可以知道前7个字符肯定不是要找的结果。
我们看到，”S”与”E”不匹配。这时，”S”就被称为”坏字符”（bad character），即不匹配的字符。我们还发现，”S”不包含在搜索词”EXAMPLE”之中，这意味着可以把搜索词直接移到”S”的后一位。

依然从尾部开始比较，发现”P”与”E”不匹配，所以”P”是”坏字符”。但是，”P”包含在搜索词”EXAMPLE”之中。所以，将搜索词后移两位，两个”P”对齐。

我们由此总结出”坏字符规则”：

1	后移位数 = 坏字符的位置（在搜索串中的位置） - 搜索词中的上一次出现位置（最新比较的位置）

如果”坏字符”不包含在搜索词之中，则上一次出现位置为 -1。
以”P”为例，它作为”坏字符”，出现在搜索词的第6位（从0开始编号），在搜索词中的上一次出现位置为4，所以后移6 - 4 = 2位。再以前面第二步的”S”为例，它出现在第6位，上一次出现位置是-1（即未出现），则整个搜索词后移6 - (-1) = 7位。

依然从尾部开始比较，”E”与”E”匹配。

比较前面一位，”LE”与”LE”匹配。

比较前面一位，”PLE”与”PLE”匹配。

比较前面一位，”MPLE”与”MPLE”匹配。我们把这种情况称为”好后缀”（good suffix），即所有尾部匹配的字符串。注意，”MPLE”、”PLE”、”LE”、”E”都是好后缀。

比较前一位，发现”I”与”A”不匹配。所以，”I”是”坏字符”。

根据”坏字符规则”，此时搜索词应该后移2 - （-1）= 3位。问题是，此时有没有更好的移法？

我们知道，此时存在”好后缀”。所以，可以采用”好后缀规则”：

1	后移位数 = 好后缀的位置 - 搜索词中的上一次出现位置

计算时，位置的取值以”好后缀”的最后一个字符为准。如果”好后缀”在搜索词中没有重复出现，则它的上一次出现位置为-1。
所有的”好后缀”（MPLE、PLE、LE、E）之中，只有”E”在”EXAMPLE”之中出现两次，所以后移6 - 0 = 6位。

可以看到，”坏字符规则”只能移3位，”好后缀规则”可以移6位。所以，Boyer-Moore算法的基本思想是，每次后移这两个规则之中的较大值。
更巧妙的是，这两个规则的移动位数，只与搜索词有关，与原字符串无关。因此，可以预先计算生成《坏字符规则表》和《好后缀规则表》。使用时，只要查表比较一下就可以了。

继续从尾部开始比较，”P”与”E”不匹配，因此”P”是”坏字符”。根据”坏字符规则”，后移6 - 4 = 2位。

从尾部开始逐位比较，发现全部匹配，于是搜索结束。如果还要继续查找（即找出全部匹配），则根据”好后缀规则”，后移6 - 0 = 6位，即头部的”E”移到尾部的”E”的位置。

特点

Boyer-Moore 算法的主要特点有：

对模式字符的比较顺序时从右向左。
预处理需要O(m + σ)的时间和空间复杂度。
匹配阶段需要O(m × n)的时间复杂度。
匹配阶段在最坏情况下需要3n次字符比较。
最优复杂度O(n/m)。
概念
Boyer–Moore算法在对模式P字符串进行预处理时，将采用两种不同的启发式方法。这两种启发式的预处理方法称为：

坏字符（Bad Character Heuristic）：当文本 T 中的某个字符跟模式P的某个字符不匹配时，我们称文本T中的这个失配字符为坏字符。
好后缀（Good Suffix Heuristic）：当文本 T 中的某个字符跟模式P的某个字符不匹配时，我们称文本T中的已经匹配的字符串为好后缀。

Boyer–Moore算法在预处理时，将为两种不同的启发法结果创建不同的数组，分别称为 Bad-Character-Shift（or The Occurrence Shift）和Good-Suffix-Shift（or Matching Shift）。当进行字符匹配时，如果发现模式P中的字符与文本T中的字符不匹配时，将比较两种不同预处理方法，选择最大的一个值来对模式串（搜索词）P的位置进行滑动。

坏字符

当出现一个坏字符时, BM算法向右移动模式串, 让模式串中最靠右的对应字符（对应的坏字符串）与坏字符相对，然后继续匹配。坏字符算法有两种情况。

模式串中有对应的坏字符时，让模式串中最靠右的对应字符（对应的坏字符串）与坏字符相对（PS：BM不可能走回头路，因为若是回头路，则移动距离就是负数了，肯定不是最大移动步数了），如下图。
模式串中不存在文本串的坏字符，直接右移模式串（未匹配的，示例：aba移动到c位置后）长度。

为了用代码来描述上述的两种情况，设计一个数组bmBc['k']，表示坏字符‘k’在模式串中出现的位置距离模式串末尾的最大长度，那么当遇到坏字符的时候，模式串可以移动距离为：shift(坏字符) = bmBc[T[i]]-（m-1-i)。

好后缀
有三种情况：
模式串中有子串匹配上好后缀，此时移动模式串，让该文本串和好后缀对齐即可，如果超过一个子串匹配上好后缀，则选择最靠左边的子串对齐。
模式串中没有文本串匹配上好后缀，此时需要寻找模式串的一个最长前缀，并让该前缀等于好后缀的后缀，寻找到该前缀后，让该前缀和好后缀对齐即可。

模式串中没有文本串匹配上好后缀，并且在模式串中找不到最长前缀，让该前缀等于好后缀的后缀。此时，直接移动模式串到好后缀（位置的取值以”好后缀”的最后一个字符为准）的下一个字符。

代码

C++代码

#include <stdio.h>
#include <string.h>
 
#define MAX_CHAR 256
#define SIZE 256
#define MAX(x, y) (x) > (y) ? (x) : (y)
 
void BoyerMoore(char *pattern, int m, char *text, int n);
 
int main()
{
	//char pattern[256] ={ 'b','a','b','c','d','e'};
	//char text[256]={'a','b','c','d','e','f','g','b','c','d','e'};
	char text[256] ={ 'I','a','m','h','a','p','p','y'},pattern[256]={'a','p',};
	BoyerMoore(pattern, strlen(pattern), text, strlen(text));

	/*char pattern[256]={'e','x','a','m','p','l','e',};
	int suff[SIZE];
    int i, j,m=strlen(pattern);*/


    return 0;
}


//@pattern：搜索串
//@m：搜索串长度
//计算坏字符数组
//一个数组bmBc['k']，表示坏字符‘k’在模式串中出现的位置距离模式串末尾的最大长度，那么当遇到坏字符的时候，模式串可以移动距离为： shift(坏字符) = bmBc[T[i]]-（m-1-i)
void PreBmBc(char *pattern, int m, int bmBc[])
{
    int i;
	//条件2：坏字符没有出现在搜索串中，移动整个搜索串(初始化的时候默认移动)
	//初始化bmBc数据
	for (i = 0; i < MAX_CHAR; i++){
        bmBc[i] = m;
	}

    //条件1
	//bmBc存字符:似乎只需要bmBc[i] = m - 1 - i就行了，但这样是不对的，因为i位置处的字符可能在pattern中多处出现，而我们需要的是最右边的位置，这样就需要每次循环判断了，
	//非常麻烦，性能差。这里有个小技巧，就是使用字符作为下标而不是位置数字作为下标。这样只需要遍历一遍即可，这是空间换时间的做法.
	for (i = 0; i < m - 1; i++){
		//坏字符在搜索串中需要移动的距离
        bmBc[pattern[i]] = m - 1 - i;
	} 
    printf("bmBc[]:");
    for (i = 0; i < m; i++)
    {
		printf("%d ",bmBc[pattern[i]]);
	}
	printf("\n");
}
 
//打印方法
//array：字符数组指针
//n长度
//字符数组名字
void print(int *array, int n, char *arrayName)
{
    int i;
 
    printf("%s:", arrayName);
    for (i = 0; i < n; i++)
    {
		printf("%d ", array[i]);
	}
	printf("\n");
}

//计算好后缀数组
//@pattern：模式串
//@m：模式串长度
//@suff：suffix[i] = s 表示以i为边界的字符后缀，与模式串后缀匹配的最大长度。（pattern中以i位置字符为后缀和以最后一个字符为后缀的公共后缀串的长度）
void suffix_old(char *pattern, int m, int suff[])
{
	//i：从后往前比较的标记位,好后缀最后1个字符的位置
	//j: 在搜索串中下标位置比较，表示好后缀在搜索词中上1次出现的位置
    int i, j;
	//定义最后一个字符需要移动的距离：整个搜索串。例子：最后1个字符是好后缀，且没有在搜索串中出现，最后一个字符移动距离=(m-1)-(-1)=m
    //suffix[i] = s 表示以i为边界的字符后缀，与模式串后缀匹配的最大长度。
    suff[m - 1] = m;
	//从最后一个串开始比较
    for (i = m - 2; i >= 0; i--)
    {
        j = i;
		//2个字符匹配，继续往前找字符
		//j<(m-1-i+j):相差1位
		//如果j与j+1，不相同j=i=i--，m-1-i+j也就是最后1个字符，如果最后1个字符找相同的字符
        while (j >= 0 && pattern[j] == pattern[m -1 - i + j]){
            j--;
		}
		//获取匹配的最大长度
		//i：好后缀最后1个字符的位置
		//suff[i]：i位置需要移动的距离
		//j:i位置的字符出现在搜索串中上1个的位置
        //非好后缀，suff的值是0,不做移动
		suff[i] = i - j;
    }
}

//改进计算suffix方法
void suffix(char *pattern, int m, int suff[]) 
{
    int f, g, i;
	
    suff[m - 1] = m;
	//上一次进行成功匹配的失配位置
    g = m - 1;
	//i开始匹配的位置
    for (i = m - 2; i >= 0; --i)
    {
		//f:上一个成功进行匹配的起始位置
        if (i > g && suff[i + m - 1 - f] < i - g){
			//获取匹配的最大长度
            suff[i] = suff[i + m - 1 - f];
		}
        else
        {
			//g的最大值就是i
            if (i < g){
                g = i;
			}
			//
            f = i;
			//寻找上失配的位置
            while (g >= 0 && pattern[g] == pattern[g + m - 1 - f]){
                --g;
			}
			//最大的匹配长度
            suff[i] = f - g;
        }
    }
    //print(suff, m, "suff[]");
}
 
//好后缀规则
//@pattern：搜索串
//@m：搜索串长度
//@bmGs: bmGs[i] 表示遇到好后缀时，搜索串应该移动的距离，其中i表示好后缀前面一个字符的位置
void PreBmGs(char *pattern, int m, int bmGs[])
{
    int i, j;
    int suff[SIZE];
 
    // 计算后缀数组
    suffix(pattern, m, suff);
	//suffix_old(pattern, m, suff);

    //条件3:模式串中没有子串匹配上好后缀
	//以好后缀最后1个字符位置为标准，移动距离：(m-1)-(-1)
    for (i = 0; i < m; i++){
        bmGs[i] = m;
	}
 
    //条件2：模式串有前缀字符串匹配好后缀的后缀
    //范围内的任意位置:模式串前缀匹配好后缀后1位置开始-好后缀的前一个字符串标记位
	j = 0;
    //i:从尾部往前比较
	//i：好后缀的前1个位置
	for (i = m - 1; i >= 0; i--)
    {
		//i等于其最大匹配长度,说明i位置的字符没有在模式串中重复出现
        if (i + 1 == suff[i])
        {
			//j从前往后比较
			//模式串的前缀的位置开始->模式串后缀前1个位置
            for (;j < m - 1 - i; j++)
            {
				//如果j的最大移动位置不是m，说明j的位置字符串在模式串中重复存在，j的位置就是在好后缀范围内。根据公式：后移位数 = 好后缀的位置 - 搜索词中的上一次出现位置
				//如果j的最大移动位置是m，说明j不在好后缀的范围内
                if (m == bmGs[j]){
					//j位置移动的最大距离.好后缀在模式串匹配的前缀+1位置开始移动
                    bmGs[j] = m - 1 - i;
				}
            }
        }
    }
 
    //条件1
    for (i = 0; i <= m - 2; i++){
		int k=m - 1 - suff[i];
		int p=m - 1 - i;
		//suff[i]：i位置的字符后缀与搜索串最后1个字符的后缀相同的最大长度，好后缀的最大长度
		//在位置：m - 1 - suff[i]，需要移动的距离：m-1-i
        bmGs[k] = p;
	}

    print(bmGs, m , "bmGs[]");
}
 


//@pattern：搜索词字符串
//@m：搜索词字符串长度
//@text：文本字符串
//@m：文本字符串长度
void BoyerMoore(char *pattern, int m, char *text, int n)
{
    int i, j, bmBc[MAX_CHAR], bmGs[SIZE];
	//获取坏字符
    PreBmBc(pattern, m, bmBc);
    //获取最好后缀
	PreBmGs(pattern, m, bmGs);

    // Searching
    j = 0;
    while (j <= n - m)
    {
		//文本串和模式串从尾部开始比较
		for (i = m - 1; i >= 0 && pattern[i] == text[i + j]; i--){
			printf("\n");
		};

		//找到第一个不匹配的位置，然后说明是最好后缀模式，输出模式串的移动距离
		//i=0;i在搜索串第1个字符上,i--:在搜索字符串外的第1个位置
        if (i < 0)
        {
            printf("Find it, the position is %d\n", j);
			//在搜索字符串0这个位置需要移动的距离
            j += bmGs[0];
            return ;
        }
        else{
			//坏字符规则:后移位数 = 坏字符的位置 - 搜索词中的上一次出现位置
			//选择最大的一个值来对模式串（搜索词） 的位置进行滑动.
			//坏字符串移动的距离：bmBc[text[i + j]] - （m - 1 - i）,bmBc[text[i + j]]:坏字符在搜索字符串中出现的位置到搜索字符串末端的最大距离
			//好后缀移动的距离：bmGs[i]
			//i在搜索串这个位置，搜索串需要移动的距离
            j += MAX(bmBc[text[i + j]] - m + 1 + i, bmGs[i]);
		}
	}
 
    printf("No find.\n");
}

Java代码1

package demo1;

public class BoyerMoore {
	/**
	 * 
	 * right：存储搜索串中每个字符的位置，字符用ascii表示
	 * 这里字符的位置是最新比较的位置。每一个字符位置都会覆盖之前字符的位置
	 * 坏字符串：搜索词中的上一次出现位置，实际上就是最靠右的字符的位置，因为从后->前 
	 *  
	 * @param pat
	 * @param right
	 */
	public static void getRight(String pat, int[] right) {
		for (int i = 0; i < 256; i++) {
			right[i] = -1;
		}
		for (int i = 0; i < pat.length(); i++) {
			right[pat.charAt(i)] = i;
		}
	}

	/**
	 * 只实现坏字符串，没有实现好后缀
	 * @param txt
	 * @param pat
	 * @param right
	 * @return
	 */
	public static int BoyerMooreSearch(String txt, String pat, int[] right) {
		int M = txt.length();
		int N = pat.length();
		//跳跃标记（搜索串移动距离-后移的位置）
		//此代码指：文本串标记位的移动
		int skip;
		//文本串从头开始比较
		for (int i = 0; i <= M - N; i += skip) {
			skip = 0;
			//搜索串从最后1个字符开始比较
			for (int j = N - 1; j >= 0; j--) {
				//比较最后1个字符，如果最后1个字符不匹配，就是不匹配。
				//j：搜索串中最后1个位置，从后->前
				//i+j：上1次不匹配移动距离+搜索串的长度
				if (pat.charAt(j) != txt.charAt(i + j)) {
					//坏字符串：后移位数 = 坏字符的位置（在搜索串中的位置） - 搜索词中的上一次出现位置（最新比较的位置）
					//坏字符的位置：当前在搜索串不匹配的位置.
					//right[txt.charAt(i + j)]：坏字符串在搜索串的出现位置。未出现=-1
					skip = j - right[txt.charAt(i + j)];
					if (skip < 1) {
						skip = 1;
					}
					break;
				}
			}
			if (skip == 0){
				return i;
			}
		}
		return -1;
	}

	public static void main(String[] args) {
		String txt = "BBC ABCDAB AACDABABCDABCDABDE";
		String pat = "ABCDABD";
		int[] right = new int[256];
		getRight(pat, right);
		System.out.println(BoyerMooreSearch(txt, pat, right));
	}
}

Java代码2

package demo1;

import java.util.HashMap;
import java.util.Map;

/**
 * 与C++版本类似。
 *
 */
public class BoyerMoore2 {

	public static void main(String[] args) {
		String text = "aadqaqqcdqqasdadacbcdbccbcd";
		String pattern = "adacbcd";
		BoyerMoore2 bm = new BoyerMoore2();
		bm.boyerMoore(pattern, text);
	}

	public void boyerMoore(String pattern, String text) {
		int m = pattern.length();
		int n = text.length();
		Map<String, Integer> bmBc = new HashMap<String, Integer>();
		int[] bmGs = new int[m];
		// proprocessing
		preBmBc(pattern, m, bmBc);
		preBmGs(pattern, m, bmGs);
		// searching
		int j = 0;
		int i = 0;
		int count = 0;
		while (j <= n - m) {
			for (i = m - 1; i >= 0 && pattern.charAt(i) == text.charAt(i + j); i--) { // 用于计数
				count++;
			}
			if (i < 0) {
				System.out.println("one position is:" + j);
				j += bmGs[0];
			} else {
				j += Math.max(bmGs[i], getBmBc(String.valueOf(text.charAt(i + j)), bmBc, m) - m + 1 + i);
			}
		}
		System.out.println("count:" + count);
	}

	private void preBmBc(String pattern, int patLength, Map<String, Integer> bmBc) {
		System.out.println("bmbc start process...");
		{
			for (int i = patLength - 2; i >= 0; i--)
				if (!bmBc.containsKey(String.valueOf(pattern.charAt(i)))) {
					bmBc.put(String.valueOf(pattern.charAt(i)), (Integer) (patLength - i - 1));
				}
		}
	}

	private void preBmGs(String pattern, int patLength, int[] bmGs) {
		int i, j;
		int[] suffix = new int[patLength];
		suffix(pattern, patLength, suffix);
		// 模式串中没有子串匹配上好后缀，也找不到一个最大前缀
		for (i = 0; i < patLength; i++) {
			bmGs[i] = patLength;
		}
		// 模式串中没有子串匹配上好后缀，但找到一个最大前缀
		j = 0;
		for (i = patLength - 1; i >= 0; i--) {
			if (suffix[i] == i + 1) {
				for (; j < patLength - 1 - i; j++) {
					if (bmGs[j] == patLength) {
						bmGs[j] = patLength - 1 - i;
					}
				}
			}
		}
		// 模式串中有子串匹配上好后缀
		//aacbcd
		for (i = 0; i < patLength - 1; i++) {
			bmGs[patLength - 1 - suffix[i]] = patLength - 1 - i;
		}
		System.out.print("bmGs:");
		for (i = 0; i < patLength; i++) {
			System.out.print(bmGs[i] + ",");
		}
		System.out.println();
	}

	private void suffix(String pattern, int patLength, int[] suffix) {
		suffix[patLength - 1] = patLength;
		int q = 0;
		for (int i = patLength - 2; i >= 0; i--) {
			q = i;
			while (q >= 0 && pattern.charAt(q) == pattern.charAt(patLength - 1 - i + q)) {
				q--;
			}
			suffix[i] = i - q;
		}
		System.out.println();
	}

	private int getBmBc(String c, Map<String, Integer> bmBc, int m) {
		// 如果在规则中则返回相应的值，否则返回pattern的长度
		if (bmBc.containsKey(c)) {
			return bmBc.get(c);
		} else {
			return m;
		}
	}

}

Java代码3

package demo1;

/**
 * 好后缀方法没有看懂
 * @author Administrator
 *
 */
public class BoyerMoore3 {

	public static final int ALPHABET_SIZE = Character.MAX_VALUE + 1;
	//文本串
	private String text;
	//搜索串
	private String pattern;

	private int[] last;
	//搜索串-好后缀：每个字符移动的距离
	private int[] match;
	private int[] suffix;

	public BoyerMoore3(String pattern, String text) {
		this.text = text;
		this.pattern = pattern;
		last = new int[ALPHABET_SIZE];
		match = new int[pattern.length()];
		suffix = new int[pattern.length()];
	}

	/**
	 * Searches the pattern in the text. Returns the position of the first
	 * occurrence, if found and -1 otherwise.
	 */
	public int match() {
		// Preprocessing
		computeLast();
		computeMatch();

		// Searching
		int i = pattern.length() - 1;
		int j = pattern.length() - 1;
		while (i < text.length()) {
			if (pattern.charAt(j) == text.charAt(i)) {
				if (j == 0) {
					// the left-most match is found
					return i;
				}
				j--;
				i--;
			} else { // a difference
				//i：文本串开始比较的位置
				//Math.max(j - last[text.charAt(i)], match[j])：坏字符串，好后缀的最大值
				//pattern.length()：从文本串+搜索串最后1个位置开始
				i =i+ pattern.length() - j - 1 + Math.max(j - last[text.charAt(i)], match[j]);
				//搜索串不相同，从尾端重新开始
				j = pattern.length() - 1;
			}
		}
		return -1;
	}

	/**
	 * Computes the function
	 * <i>last</i> and stores its values in the array
	 * <code>last</code>. The function is defined as follows:
	 * 
	 * <pre>
	 * last(Char ch) = the index of the right-most occurrence of the character ch in the pattern;  
	 *                 -1 if ch does not occur in the pattern.
	 * </pre>
	 * 
	 * The running time is O(pattern.length() + |Alphabet|).
	 * 
	 * 存储搜索串中每个字符的位置，字符用ascii表示
	 * 这里字符的位置是最新比较的位置。每一个字符位置都会覆盖之前字符的位置
	 * 坏字符串：搜索词中的上一次出现位置，实际上就是最靠右的字符的位置，因为从后->前 
	 * 
	 */
	private void computeLast() {
		for (int k = 0; k < last.length; k++) {
			last[k] = -1;
		}
		for (int j = pattern.length() - 1; j >= 0; j--) {
			if (last[pattern.charAt(j)] < 0) {
				last[pattern.charAt(j)] = j;
			}
		}
	}

	/**
	 * Computes the function 
	 * <i>match</i> and stores its values in the array
	 * <code>match</code>. The function is defined as follows:
	 * 
	 * <pre>
	 * match(j) = min{ s | 0 < s <= j && p[j-s]!=p[j] 
	 *                            && p[j-s+1]..p[m-s-1] is suffix of p[j+1]..p[m-1] },  
	 *                                                         if such s exists, else 
	 *            min{ s | j+1 <= s <= m  
	 *                            && p[0]..p[m-s-1] is suffix of p[j+1]..p[m-1] },  
	 *                                                         if such s exists, 
	 *            m, otherwise, 
	 * where m is the pattern's length and p is the pattern.
	 * </pre>
	 * 
	 * The running time is O(pattern.length()).
	 */
	private void computeMatch() {
		/* Phase 1 */
		//条件3
		for (int j = 0; j < match.length; j++) {
			match[j] = match.length;
		} // O(m)

		computeSuffix(); // O(m)

		//条件1
		/* Phase 2 */
		// Uses an auxiliary array, backwards version of the KMP failure function.
		// suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],    j>i，好后缀在搜索串有重复的好后缀匹配
		// if there is no such j, suffix[i] = m
		// Compute the smallest shift s, such that 0 < s <= j and
		// p[j-s]!=p[j] and p[j-s+1..m-s-1] is suffix of p[j+1..m-1] or j == m-1}, if such s exists,
		for (int i = 0; i < match.length - 1; i++) {
			// suffix[i+1] <= suffix[i] + 1
			int j = suffix[i + 1] - 1; 
			// therefore pattern[i] != pattern[j]
			if (suffix[i] > j) {
				match[j] = j - i;
			} else {
				// j == suffix[i]
				match[j] = Math.min(j - i + match[i], match[j]);
			}
		} // End of Phase 2

		//条件2
		/* Phase 3 */
		// Uses the suffix array to compute each shift s such that
		// p[0..m-s-1] is a suffix of p[j+1..m-1] with j < s < m   //搜索串中的最大前缀和好后缀匹配
		// and stores the minimum of this shift and the previously computed one.
		if (suffix[0] < pattern.length()) {
			for (int j = suffix[0] - 1; j >= 0; j--) {
				if (suffix[0] < match[j]) {
					match[j] = suffix[0];
				}
			}
			int j = suffix[0];
			for (int k = suffix[j]; k < pattern.length(); k = suffix[k]) {
				while (j < k) {
					if (match[j] > k){
						match[j] = k;
					}
					j++;
				}
			}
		}// endif
	}

	/**
	 * Computes the values of <code>suffix</code>, 
	 * which is an auxiliary array, backwards version of the KMP failure function. <br>
	 * suffix[i] = the smallest j > i s.t. （j>i的最小值，）
	 * p[j..m-1] is a prefix of p[i..m-1]（后缀匹配）, if there is no such j, suffix[i] = m, i.e. <br>
	 * p[suffix[i]..m-1] is the longest prefix of p[i..m-1]（）, if suffix[i] < m. <br>
	 *  The running time for computing the <code>suffix</code> is O(m).
	 *  
	 *  pattern中以i位置字符为后缀和以最后一个字符为后缀的公共后缀串的长度
	 */
	private void computeSuffix() {
		//最后1个字符，在搜索串中的后缀匹配长度：m
		suffix[suffix.length - 1] = suffix.length;
		int j = suffix.length - 1;
		// suffix[i] = m - the length of the longest prefix of p[i..m-1]
		for (int i = suffix.length - 2; i >= 0; i--) {
			//i：前1个字符
			//j：后1个字符
			//如果不匹配，说明找到最大好后缀
			while (j < suffix.length - 1 && pattern.charAt(j) != pattern.charAt(i)) {
				//
				j = suffix[j + 1] - 1;
			}
			//如果字符匹配，继续前1个字符比较
			if (pattern.charAt(j) == pattern.charAt(i)) {
				j--;
			}
			//i位置好后缀在搜索串中后缀匹配的最大长度
			suffix[i] = j + 1;
		}

	}

	public static void main(String[] args) {
//		String text = "here is a simple example  sadasdw";
//		String pattern = "example";
		String text = "aadqaqqcdqqasdaacbcdbccbcd";
		String pattern = "aacbcd";
		System.out.println(new BoyerMoore3(pattern, text).match());
	}

}

代码解释

坏字符

定义一个bmBc数组：一个数组bmBc['k']，表示坏字符’k’在模式串中出现的位置距离模式串末尾的最大长度。
这个计算应该很容易，似乎只需要bmBc[i] = m - 1 - i就行了，但这样是不对的，因为i位置处的字符可能在模式串中多处出现（如下图所示），而我们需要的是最右边的位置，这样就需要每次循环判断了，非常麻烦，性能差。这里有个小技巧，就是使用字符作为下标而不是位置数字作为下标。这样只需要遍历一遍即可，这是空间换时间的做法。

条件1：字符在模式串中有出现，bmBc['v']表示字符v在模式串中最后一次出现的位置（期间有个循环），距离模式串串尾的长度，如上图所示。

for (i = 0; i < m - 1; i++){
        bmBc[pattern[i]] = m - 1 - i;   //bmBc['k']，表示坏字符‘k’在模式串中出现的位置距离模式串末尾的最大长度；pattern[i]:模式串单个字符；
                                        //表示:第i个字符，是坏字符，距离模式串末尾长度
}

条件2：坏字符在模式串中没有出现，如模式串中没有字符v，则BmBc['v'] = strlen(pattern)。

1 2	for (i = 0; i < MAX_CHAR; i++) bmBc[i] = m; //没有一个匹配，移动整个模式串

模式串实际要右移的距离就是：bmBc['v'] - m + 1 + i。

好后缀

suffix的方法：定义一个suffix数组，该数组的意思是suffix[i] = s表示以i为边界的字符后缀，与模式串后缀匹配的最大长度。（pattern中以i位置字符为后缀和以最后一个字符为后缀的公共后缀串的长度）
定义bmGs[]数组:bmGs[i]表示遇到好后缀时，模式串应该移动的距离，其中i表示好后缀前面一个字符的位置。
条件1：模式串中有子串匹配上好后缀

或者

//条件1
 for (i = 0; i <= m - 2; i++){
        //在位置：m - 1 - suff[i]，需要移动的距离：m-1-i
        bmGs[m - 1 - suff[i]] = m - 1 - i;     //从前往后搜索匹配的模式串，i为好后缀边界的字符
 }

条件2：模式串中没有子串匹配上好后缀，但找到一个最大前缀

或者

   //条件2：模式串有前缀字符串匹配
   //好后缀的前一个字符串标记位
j = 0;
   //i:从尾部往前比较
for (i = m - 1; i >= 0; i--)
   {
	//i等于其最大匹配长度，说明：i是在模式串的匹配位置上
       //
       if (i + 1 == suff[i])
       {
		//j从前往后比较，从空白部分（如图）开始++
           for (;j < m - 1 - i; j++)
           {
			//
               if (m == bmGs[j]){
                   bmGs[j] = m - 1 - i;//假设j是在de后面的位置，也就是位置2，如果在位置2，de需要往后移动18，才能和de相匹配。所以则由：m-1-i
			}
           }
       }
   }

条件3：模式串中没有子串匹配上好后缀

或者

//条件3,每一个字符都是移动m的距离
for (i = 0; i < m; i++){
        bmGs[i] = m;            //每个字符的最好后缀都是移动整个模式串
}

改进suff方法

i：是当前正准备计算suff[]值的那个位置。（开始匹配的位置）
f：是上一个成功进行匹配的起始位置。（不是每个位置都能进行成功匹配的，实际上能够进行成功匹配的位置并不多）
g：是上一次进行成功匹配的失配位置。
如果i在g和f之间，那么一定有P[i]=P[m-1-f+i]（改进在此处）；并且如果suff[m-1-f+i] < i-g, 则suff[i] = suff[m-1-f+i]，这不就利用了前面的suff了吗。
PS：这里有些人可能觉得应该是suff[m-1-f+i] <= i – g，因为若suff[m-1-f+i] = i – g，还是没超过suff[f]的范围，依然可以利用前面的suff[]，但这是错误的，比如一个极端的例子。

1 2	i ：0 1 2 3 4 5 6 7 8 9 pattern：a a a a a b a a a a

suff[4] = 4，这里f=4,g=0，当i=3是，这时suff[m-1=f+i]=suff[8]=3，而suff[3]=4，两者不相等，因为上一次的失配位置g可能会在这次得到匹配。
suff[]与模式串后缀匹配的最大长度。
g<i<f,P[i]=P[m-1-f+i]，这是相等的。比如：以最后一个字符开始，f=m-1。如果是以前半段那个字符开始。m-1-f截取中间一半，然后+i，则到下半部分，下半部分的i和上半部分的i是同一个字符。
suff[m-1-f+i] < i-g，相当于i在偏f方向。

suff数组

在计算bmGc数组时，为提高效率，先计算辅助数组suff[]表示好后缀的长度。
suff数组的定义：m是pattern的长度。

1
2
3

a. suffix[m-1] = m;
b. suffix[i] = k
for [ pattern[i-k+1] ….,pattern[i]] == [pattern[m-1-k+1]，pattern[m-1]]

看上去有些晦涩难懂，实际上suff[i]就是求：pattern中以i位置字符为后缀和以最后一个字符为后缀的公共后缀串的长度。不知道这样说清楚了没有，还是举个例子吧。

i     : 0 1 2 3 4 5 6 7
pattern: b c  a b a b a b

当i=7时，按定义suff[7] = strlen(pattern) = 8。
当i=6时，以pattern[6]为后缀的后缀串为bcababa，以最后一个字符b为后缀的后缀串为bcababab，两者没有公共后缀串，所以suff[6] = 0。
当i=5时，以pattern[5]为后缀的后缀串为bcabab，以最后一个字符b为后缀的后缀串为bcababab，两者的公共后缀串为abab，所以suff[5] = 4。
以此类推……
当i=0时，以pattern[0]为后缀的后缀串为b，以最后一个字符b为后缀的后缀串为bcababab，两者的公共后缀串为b，所以suff[0] = 1。

# 分词算法, 算法