Skip to content

tulz-app/stringdiff

Repository files navigation

Maven Central

app.tulz.diff

Myers diff algorithm in Scala.

"app.tulz" %%% "stringdiff" % "0.3.4" 

Overview

The core algorithm is a Scala translation of the Python implementation described here: https://blog.robertelder.org/diff-algorithm/

Additionally, the following is implemented on top of it:

  • interpretation of the algorithm's output
  • customizable formatting of the interpreted result into a string (with a few provided out of the box formatters)
  • additional transformations for when the input is expected to be a sequence of tokens (TokenDiff)

Interpretation

The core algorithm outputs a set of instructions to get from SeqA to SeqB, something like this:

in order to get from s1="bcdefgzio" to s2="abcxyfgi":

Insert a from s2 before position 0 into s1.
Delete d from s1 at position 2 in s1.
Insert x from s2 before position 3 into s1.
Insert y from s2 before position 3 into s1.
Delete e from s1 at position 3 in s1.
Delete z from s1 at position 6 in s1.
Delete o from s1 at position 8 in s1.

It's hard to work with it. Also, it is somewhat tricky at first to understand how to follow these instructions (try it! :) ).

MyersInterpret object parses the raw output into a List[DiffElement]:

trait DiffElement[Repr]

object DiffElement {
  final case class InBoth[Repr](v: Repr)        extends DiffElement[Repr]
  final case class InFirst[Repr](v: Repr)       extends DiffElement[Repr]
  final case class InSecond[Repr](v: Repr)      extends DiffElement[Repr]
  final case class Diff[Repr](x: Repr, y: Repr) extends DiffElement[Repr]
}

The result of the interpretation (without any transformations applied) for the same example:

StringDiff
  .raw(
    "bcdefgzio",
    "abcxyfgi",
    collapse = false
  ).mkString("[\n  ", "\n  ", "\n]")
[
  InSecond(a)
  InBoth(bc)
  InFirst(d)
  InSecond(x)
  InSecond(y)
  InFirst(e)
  InBoth(fg)
  InFirst(z)
  InBoth(i)
  InFirst(o)
]

Collapsing

By default, diff functions will collapse the diff:

StringDiff
  .raw(
    "bcdefgzio",
    "abcxyfgi",
    collapse = true // default is true
  ).mkString("[\n  ", "\n  ", "\n]")
[
  InSecond(a)
  InBoth(bc)
  Diff(de,xy)
  InBoth(fg)
  InFirst(z)
  InBoth(i)
  InFirst(o)
]

Here, the following list of DiffElements:

[
  InFirst(d)
  InSecond(x)
  InSecond(y)
  InFirst(e)
]

got collapsed into a single one:

[
  Diff(de,xy)
]

In a nutshell, collapsing removes empty elements and joins same or otherwise "join-able" subsequent DiffElements.

Examples:

  • any InFirst, InLast, Diff or InBoth gets removed if the element is empty
  • Diff becomes InFirst or InSecond if one the elements is empty
  • InFirst(a)+InFirst(b) -> InFirst(ab)
  • InFirst(a)+InSecond(b) -> Diff(a,b)
  • InSecond(a)+InDiff(b,c) -> Diff(ab,c)
  • etc

Diff'ing sequences:

SeqDiff
  .seq(
    Seq(1, 2, 3, 4, 5).toIndexedSeq,
    Seq(1, 2, 8, 3, 8, 4, 5, 0).toIndexedSeq
  ).mkString("[\n  ", "\n  ", "\n]")
[
  InBoth(Vector(1, 2))
  InSecond(Vector(8))
  InBoth(Vector(3))
  InSecond(Vector(8))
  InBoth(Vector(4, 5))
  InSecond(Vector(0))
]

Diff'ing strings:

Raw diff AST:
  println(
    StringDiff.diff(
      "bcdefgzio",
      "abcxyfgi"
    )
  )
[
  InSecond(a)
  InBoth(bc)
  Diff(de,xy)
  InBoth(fg)
  InFirst(z)
  InBoth(i)
  InFirst(o)
]  
Text output:
  println(
    StringDiff.text(
      "bcdefgzio",
      "abcxyfgi"
    )
  )
[∅|a]]bc][de|xy]]fg][z|∅]]i][o|∅]
ANSI color output:
  println(
    StringDiff.ansi(
      "bcdefgzio",
      "abcxyfgi"
    )
  )

screenshot1

(default formatters highlight missing text with yellow, extraneous text — with red, and matching text is underlined)

Diff'ing tokens

When the inputs are strings that are expected to contain whitespace-separated tokens, TokenDiff will try to make the diff more comprehensible in terms of tokens (while preserving the accuracy).

  println(
    TokenDiff.ansi(
      "match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
      "prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
    )    
  )

screenshot3

With a StringDiff the output would look like the following:

  println(
    StringDiff.ansi(
      "match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
      "prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
    )    
  )

screenshot4

Inline diffs for both strings
  println(
    TokenDiff.ansiBoth(
      "match-1 match-2 diff-1 diff-2 match-3 match-4 diff-1 diff-2 match-1 match-2 diff-1 match-3 match-4 diff-1 match-1 match-2 diff-1 match-3 match-4 suffix-1",
      "prefix-1 match-1 match-2 diff-3 match-3 match-4 match-1 match-2 diff-2 diff-3 match-3 match-4 diff-2 match-1 match-2 match-3 match-4"
    )    
  )

screenshot3

Usage

Diff'ing Seqs:
SeqDiff.seq(
  Seq(1, 2, 3),
  Seq(2, 3, 4)
)
Diff'ing Stringss:
StringDiff.ansi("abc", "acb")
// OR
StringDiff("abc", "acb")

StringDiff.ansiBoth("abc", "acb")
StringDiff.text("abc", "acb")
StringDiff.diff("abc", "acb")
StringDiff.raw("abc", "acb")
Diff'ing Stringss with tokens:
TokenDiff.ansi("abc", "acb")
// OR
TokenDiff("abc", "acb")

TokenDiff.ansiBoth("abc", "acb")
TokenDiff.text("abc", "acb")
TokenDiff.diff("abc", "acb")
TokenDiff.raw("abc", "acb")

Custom formatters

Formatters are instances of the DiffFormat[Out] trait:

trait DiffFormat[Out] {

  def apply(diff: List[DiffElement[String]]): Out

}
object MyFormat extends DiffFormat[MyDiffOutput] { ... }

val diff: MyDiffOutput = MyFormat(StringDiff("abc", "acb"))

For example, the TextDiffFormat is implemented like this:

object TextDiffFormat extends DiffFormat[String] {

  import DiffElement._

  def apply(diff: List[DiffElement[String]]): String = {
    val sb = new StringBuilder
    diff.foreach {
      case InBoth(both) =>
        sb.append("]")
        sb.appendAll(both)
        sb.append("]")
      case InSecond(second) =>
        sb.append("[∅|")
        sb.appendAll(second)
        sb.append("]")
      case InFirst(first) =>
        sb.append("[")
        sb.appendAll(first)
        sb.append("|∅]")
      case Diff(first, second) =>
        sb.append("[")
        sb.appendAll(first)
        sb.append("|")
        sb.appendAll(second)
        sb.append("]")
      case _ =>
    }
    sb.toString()
  }

Author

Iurii Malchenko – @yurique

License

stringdiff is provided under the MIT license.