Description

Compute the longest common subsequence of two sequences, and generate a script to transform one sequence into the other.

Author

Ivan Raikov

Version

Requires

Usage

(require-extension npdiff)

Download

npdiff.egg

Documentation

The npdiff library is an implementation of an algorithm for sequence comparison that has a worst-case running time of O(N*P), where P is the number of deletions required by the shortest edit script between the two sequences. The algorithm is described in the following paper:

Sun Wu, Udi Manber, Eugene Myers, and Webb Miller. An O(NP) sequence comparison algorithm. In Information Processing Letters, volume 35, pages 317--323, September 1990.

This library defines a data type that describes the three basic operations of an edit script (remove, insert, change) and a procedure that implements the algorithm for sequences of an arbitrary type, provided an equality predicate for that type, and accessors for the sequence.

Data types

datatype: (define-datatype diffop diffop? ...)

Representation of the three diff operations: insert, remove, change. The target in each operation is an index or a range of indices in the sequence being transformed (target sequence), and the source is a range of indices in the sequence being read from (source sequence). The operation definitions are:

(Insert target source seq context)

Insert operation;

  • TARGET is the index at which the insertion takes place;
  • SOURCE is a range in the form (x . y) that specifies the indices of those elements in the source sequence that are inserted in the target sequence;
  • SEQ is the target sequence, and
  • CONTEXT is an optional context provided in the form (BEFORE . AFTER) where BEFORE is a sequence of elements from the source preceding the elements being inserted, and AFTER is a sequence of elements from the source that follow the elements being inserted.

(Remove target seq context)

Remove operation;

  • TARGET is a range of elements in the form (x . y) that are being deleted from the target sequence;
  • SEQ is the target sequence, and
  • CONTEXT is an optional context provided in the form (BEFORE . AFTER) where BEFORE is a sequence of elements from the target preceding the elements being deleted, and AFTER is a sequence of elements from the target that follow the elements being deleted.

(Change target source seqin seqout contextin contextout)

Change operation;

  • TARGET is a range of elements in the form (x . y) that are being modified in the target sequence;
  • SOURCE is a range of elements in the form (x . y) from the source sequence that are replacing the specified elements in the target sequence;
  • SEQIN is the source sequence, and SEQOUT is the target sequence;
  • CONTEXTOUT is an optional context provided in the form (BEFORE . AFTER) where BEFORE is a sequence of elements from the target preceding the elements being changed (SEQOUT), and AFTER is a sequence of element from the target that follow the elements being changed (SEQOUT); CONTEXTIN is an optional context provided in the form (BEFORE . AFTER) where BEFORE is a sequence of elements from the source preceding the replacement elements (SEQIN), and AFTER is a sequence of elements from the source that follow the replacement elements (SEQIN);

Procedures

procedure: make-diff:: EQUAL? SEQ-REF SEQ-LENGTH HUNKS-PROC -> DIFF-PROC

create a diff procedure, where

  • EQUAL? is an equality predicate for elements of the type being compared;
  • SEQ-REF is a procedure of the form LAMBDA SEQ INDEX -> ELEMENT which returns element INDEX from sequence SEQ;
  • SEQ-LENGTH is a procedure of the form LAMBDA SEQ -> LEN which returns the length of sequence SEQ;
  • HUNKS is a procedure of the type created by make-hunks, which is described below.
The returned procedure is of the form LAMBDA A B [CONTEXT-LEN] -> (HUNK ... ) where A and B are two sequences to be compared, and CONTEXT-LEN is an optional context length, which is used by the HUNKS procedure to create context in the hunks of the edit script

procedure: make-hunks:: SEQ-REF SEQ-LENGTH SEQ-SLICE -> HUNKS-PROC

create a procedure that creates insert/remove/change hunks given a stack of matching index ranges that is produced by the diff procedure above.

  • SEQ-REF is a procedure of the form LAMBDA SEQ INDEX -> ELEMENT which returns element INDEX from sequence SEQ;
  • SEQ-LENGTH is a procedure of the form LAMBDA SEQ -> LEN which returns the length of sequence SEQ;
  • SEQ-SLICE is a procedure of the form LAMBDA SEQ START END -> SEQ which returns a list with the elements contained within the specified range in the input sequence SEQ;
The returned procedure is of the form LAMBDA A B CSS [CONTEXT] -> (HUNK ... ) where A and B are the two sequences compared, CSS is the stack of matching ranges returned by diff, and CONTEXT is an optional argument that is used to add context to the generated hunks, if specified.

Examples

;; Create a function to find the differences between two vectors
;; containing strings
(require-extension npdiff)
(require-extension vector-lib)

(define  (textdiff text1 text2 . rest)
  (let-optionals  rest ((context-len 0))
    (let ((A   (list->vector text1))
	  (B   (list->vector text2)))
      ((make-npdiff  string=? vector-ref vector-length 
		     (make-hunks vector-ref vector-length 
				 (compose vector->list vector-copy)))
       A B context-len))))

License

Copyright 2007 Ivan Raikov. 

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

A full copy of the GPL license can be found at
<http://www.gnu.org/licenses/>.