BoyerMoore [博耶-穆尔字符串搜索算法、BM 算法]

# 介绍

在计算机科学里，博耶 - 穆尔字符串搜索算法是一种非常高效的字符串搜索算法。虽然博耶 - 穆尔算法的执行时间同样线性依赖于被搜索字符串的大小，但是通常仅为其它算法的一小部分：它不需要对被搜索的字符串中的字符进行逐一比较，而会跳过其中某些部分。通常搜索关键字越长，算法速度越快。它的效率来自于这样的事实：对于每一次失败的匹配尝试，算法都能够使用这些信息来排除尽可能多的无法匹配的位置。

更巧妙的是，这两个规则的移动位数，只与搜索词有关，与原字符串无关。因此，可以预先计算生成《坏字符规则表》和《好后缀规则表》。使用时，只要查表比较一下就可以了。

# 移动规则

移动字符数是通过两条规则决定的：坏字符规则和好后缀规则。实际移动为通过这两条规则计算出的最大移动个数。

坏字符规则：当 T 中有字符不匹配时，如果 T 中的这个不匹配的字符出现在对应 P 中当前位置的左侧，那么 P 移动位置将这两个在字符对齐。如果 T 中这个不匹配字符不在 P 中当前位置的左侧，那么将当前位置左侧的所有字符均移到该不匹配字符后。
好后缀规则：假设有 P 和 T，T 中字串 t 匹配到了 P 的一个后缀，但在比较位置 i 时发生不匹配。设匹配到的好后缀在 T 中为 t，在 P 中为 t'（t = t'）。好后缀规则的实质是，将不匹配位置右侧匹配到的字符串 t 的所有字符后缀组合，用于查找它们在 P 的不匹配位置左侧是否存在。分两种情况来讨论：
- 在 P 中 i 位置的左侧最靠近 i 位置查找字串 t' 使得 t'=t（此时 t' 不是 P 的后缀，实际上也就是查找匹配到的字串除了在 P 的后缀中存在，是否在 P 的其他位置存在），若存在，则移动相应的位数将找到的 t' 与 T 中的 t 对齐。
- 如果 t' 不存在，那我们继续查找 t 的某一个后缀是否为 P 的前缀，若存在，则移动相应的位将 P 的前缀与 t 的后缀位置对齐。否则，将 P 向后移动 n 个字符。

# 演示

可视化

# 实现

# JavaScript

只应用坏字符规则。

/*
 *Implementation of the Boyer-Moore String Search Algorithm.
 *The Boyer–Moore string search algorithm allows linear time in
 *search by skipping indices when searching inside a string for a pattern.
 **/
const buildBadMatchTable = (str) => {
  const tableObj = {}
  const strLength = str.length
  for (let i = 0; i < strLength - 1; i++) {
    tableObj[str[i]] = strLength - 1 - i
  }
  if (tableObj[str[strLength - 1]] === undefined) {
    tableObj[str[strLength - 1]] = strLength
  }
  return tableObj
}

const boyerMoore = (str, pattern) => {
  const badMatchTable = buildBadMatchTable(pattern)
  let offset = 0
  const patternLastIndex = pattern.length - 1
  const maxOffset = str.length - pattern.length
  // if the offset is bigger than maxOffset, cannot be found
  while (offset <= maxOffset) {
    let scanIndex = 0
    while (pattern[scanIndex] === str[scanIndex + offset]) {
      if (scanIndex === patternLastIndex) {
        // found at this index
        return offset
      }
      scanIndex++
    }
    const badMatchString = str[offset + patternLastIndex]
    if (badMatchTable[badMatchString]) {
      // increase the offset if it exists
      offset += badMatchTable[badMatchString]
    } else {
      offset++
    }
  }
  return -1
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

更复杂的实现，同时应用坏字符规则和好后缀规则。

/**
 * Returns the index of the first occurence of given string in the phrase
 * In case of no match, returns -1
 * see https://gist.github.com/Kamilczak020/f8382eef9777e8f07d47be29a4efc04b
 * 
 * @param text string to be searched 
 * @param pattern string to be found in the text
 */
function boyerMooreSearch(text: string, pattern: string): number {
    
    // Handle edge case
    if (pattern.length === 0) {
        return -1;
    }

    let charTable = makeCharTable(pattern);
    let offsetTable = makeOffsetTable(pattern);

    for (let i = pattern.length - 1, j; i < text.length;) {
        for (j = pattern.length - 1; pattern[j] == text[i]; i--, j--) {
            if (j === 0) {
                return i;
            }
        }

        const charCode = text.charCodeAt(i);
        i+= Math.max(offsetTable[pattern.length - 1 - j], charTable[charCode]);
    }

    return -1;
}

/**
 * Creates jump table, based on mismatched character information
 */
function makeCharTable(pattern: string): number[] {
    let table = [];

    // 65536 being the max value of char + 1
    for (let i = 0; i < 65536; i++) {
        table.push(pattern.length);
    }

    for (let i = 0; i < pattern.length - 1; i++) {
        const charCode = pattern.charCodeAt(i);
        table[charCode] = pattern.length - 1 - i;
    }

    return table;
}


function makeOffsetTable(pattern: string): number[] {
    let table = [];
    table.length = pattern.length;

    let lastPrefixPosition = pattern.length;

    for (let i = pattern.length; i > 0; i--) {
        if (isPrefix(pattern, i)) {
            lastPrefixPosition = i;
        }

        table[pattern.length - i] = lastPrefixPosition - 1 + pattern.length;
    }

    for (let i = 0; i < pattern.length - 1; i++) {
        const slen = suffixLength(pattern, i);
        table[slen] = pattern.length - 1 - i + slen;
    }

    return table;
}

function isPrefix(pattern: string, p: number): boolean {
    for (let i = p, j = 0; i < pattern.length; i++, j++) {
        if (pattern[i] != pattern[j]) {
            return false;
        }

        return true;
    }
}

function suffixLength(pattern: string, p: number) {
    let len = 0;

    for (let i = p, j = pattern.length - 1; i >= 0 && pattern[i] == pattern[j]; i--, j--) {
        len += 1;
    }

    return len;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

# 扩展

# 如何计算坏字符表和好后缀表？

参见：

# 参考

编辑

#[ "ReadAgain" ]

上次更新: 2022/10/20, 20:03:22

← AlternativeStringArrange [交替合并字符串] CheckAnagram [易位构词]→