CBDEDUPE 0.2 32-bit version Copyright Rob Weir, 1996

CompuServe: 71165,2722
INTERNET: rweir@cybercom.net

This program is free for personal use.

=======================================================================
WARNING: This program produces modified ChessBase data files, something
quite difficult, and quite undocumented.  This program seems to work for
me, but don't you think it would be better if you made a backup of your
BIG ChessBase database before using me?!
=======================================================================
Files you now have:

CBDEDUPE.TXT  the file you are reading
CBDEDUPE.EXE  the CBDEDUPE program
CBDEDUPE.CFG  the weights used for game scoring

The program CBDEDUPE takes a ChessBase data file and searches for games
which are duplicates, marking these duplicated as deleted.  By "mark as
deleted" I mean that CBDEDUPE does not actually remove the game from the
CB datafile, but instead sets the delete flag on the game, which makes the
game appear grayed-out in ChessBase.  ChessBase calls this a "virtual
deletion"  The game can be "physically deleted" using the ChessBase utility
CBFRESH, the built-in option in CBWIN, or my CBSTRIP Utility.

=======================================================================

New in version 0.2 32-bit

This 32-bit version should be functionally identical to the
previous 16-bit release.  What I have done is rewrite much of the file
access and sorting routines to take advantage of the capabilities of WIN32,
using virtual memory, memory-mapped files, etc.  Testing shows that this
has resulted in a 10x improvement over the 16-bit version.

Also, I've added a "-p" "practice mode" flag.  If you run like this:

CBDEDUPE radical mainbase.cbf -p

then CBDEDUPE will not delete the games from the database, but will instead
give a list of what games it would have deleted.

=======================================================================

ChessBase for has the Nunn Utilities and CBWIN, both which allow the user
to remove duplicate games, so why would you want something else?

I'm glad you asked <g>!

1) CBDEDUPE is faster than the Nunn Utilities or CBWIN in finding duplicates.

2) CBDEDUPE finds more duplicates than the Nunn Utilities or CBWIN.

3) CBDEDUPE is free.


To be fair to ChessBase, their products have the following advantages:

1) They have a better user interface

2) With CBWIN, duplicate removal is integrated with the product

3) They have the support and reliability which only commercial
products can offer.


All in all, each program has advantages.  I've seen duplicate games which
the Nunn Utilities misses and CBDEDUPE finds, and I've seen it the other
way around.  If you want, use both.  It never hurts to have several tools
in your collection!

==========================================================================

CBDEDUPE is easy to run.  You just pass in a "Search Level" option and the
name of a ChessBase file as an argument and let it run.

For example:

CBDEDUPE RADICAL C:\CB\DATA\DUTCH.CBF

The Search Level option lets you choose how close games have to be in order
for CBDEDUPE to consider them to be duplicates.  There are three levels:

"CONSERVATIVE" in which two games are duplicates if the moves, comments,
variations and year of the games are identical.  The game with the lower
"Score" is deleted.  The Score of a game is based on several factors,
including the presence or absence of a valid year, elo score, length of
comments and variations.  The idea is that if we have two games which are
otherwise identical, we should keep the game which has the more information. 
There is a text file, CBDEDUPE.CFG, which allows you to adjust the weights
used in calculating the Score. So, if having complete player data (longer 
name, elo score, etc.) is more important to you than annotations, you can
adjust this file to your taste.

"LIBERAL" is like "CONSERVATIVE" except that when given two games, one of
which has comments and/or variations while the other has none, CBDEDUPE
will delete the unannotated game.

"RADICAL" in which two games are duplicates if the moves are identical.
Like the "LIBERAL" approach, the game with the lower score is deleted.
The main point of the "RADICAL" approach is that if you have twenty copies
of the same game, but with different annotations, CBDEDUPE will find the
one with the most comments/variations and delete the rest.  Also, with the
"RADICAL" method, the games don't need to have the same year, so if you have
a game with year=1987 and others with year=2024 and year=0, CBDEDUPE will
delete all but year=1987 (assuming they have the same moves).

Optionally, you can start CBDEDUPE with a "-v" parameter, and run in
"Verbose mode".  For example:

CBDEDUPE radical big.cbf -v

In verbose mode, CBDEDUPE writes a file called CBDEDUPE.OUT which shows
how the program decided which games to delete.  The output of the file looks
like this:

4366 (1040) vs 4615 (1595) Delete 4366
4615 (1595) vs 4372 (1624) Delete 4615
-------------------------------
2834 (94) vs 2758 (72) Delete 2758
-------------------------------
137 (383) vs 136 (449) Delete 137
136 (449) vs 138 (275) Delete 138
-------------------------------
1548 (53) vs 1597 (59) Delete 1548
-------------------------------
2139 (176) vs 2141 (252) Delete 2139
2141 (252) vs 2137 (452) Delete 2141
2137 (452) vs 2140 (302) Delete 2140
2137 (452) vs 2138 (326) Delete 2138
2137 (452) vs 2249 (410) Delete 2249
-------------------------------

Each line gives the game number and score (in parentheses) for each game
along with the number of the game which was marked to be deleted.  So, the
first line says that game 4366 (with score 1040) had the same moves as game
4615 (with score 1595) and that game 4366 (with the lower score) was deleted.

A row of dashes seperates groups of games with identical moves.

As a benchmark, I ran CBDEDUPE 16-bit, CBDEDUPE 32-bit, the Nunn Utilities
and CBWIN against a test database of 76,866 games.  I got the following results:

PROGRAM                 DUPES FOUND      TIME TO RUN
=====================================================
CBWIN                      131              3' 56"
Nunn Utilities             526 						 24' 02"
CBDEDUPE 16-bit            324             14' 11"
CBDEDUPE 32-bit            324              1' 00"

One important point to note in all of this, is that CBDEDUPE has to make
two decisions  when it thinks it finds a duplicate:

1) Are the two games really duplicates?  The criterion for this varies
with mode (conservative, liberal or radical).

2) If they really are duplicates, which game shoudl be deleted?  CBDEDUPE
always deletes the lower scoring game, based on the weights in CBDEDUPE.CBF.

=======================================================================

Now that you have a rough idea what CBDEDUPE does, let's go into a bit
more detail.

When writing a program to compare a large number of games to detect
duplicates, there are essential two ways to go:  the conservative,
deterministic, memory and time intensive way, or the more liberal, fast,
probabalistic approach.  Each method has its advantages.

The Nunn Utilities, for example, seems to use the first approach, comparing
the moves, the game length, the players names, the game result, etc.  If
the match is not exact, a duplicate is not detected.  The price for this
approach is a relatively slow, memory intensive program.  If you want to
de-dupe a 100,000  game database, you would be best to let it run over
night!

I have chosen to take a complementary approach.  My system finds more
duplicates faster, but with the small chance that an occasional pair of
games that are not duplicates will be mistaken as such.

These "mistakes"  occur for two reason:

1) CBDEDUPE only looks at the moves of the game and the year, not the
result, or players' names.  So, if two games in the same year have the
exact same sequence of moves, they are considered to be the same game.
This happens rarely in chess, except with "grandmaster draws".

2) CBDEDUPE doesn't even compare each move.  Instead, it calculates a 32-bit
Cyclic Redundancy Check (CRC) for each game and compares that.  Now, a 32-bit
CRC has over 4 billion possible values, so the possibility that two different
games would just happen to have the same CRC is very small.  Based on my
measurements this leads to around one error every 20,000 duplicates.  Typical
databases have around 10% duplicates, so in a database of 500,000 games, you
might incorrectly delete around 3 games.

So, these two "mistakes" are the price of the proabalistic design.  It is
trade between accuracy and performance.  Overall, I think CBDEDUPE gives
vastly improved performance with a minimal loss of accuracy.

===========================================================================
