Calculating the XMODEM CRC
                         J. LeVan, Ph.D.

Introduction:
     I have been in the process of developing a terminal emulator
program for my home computer.  I was very unhappy with the Xmodem
transfer  rate,  especially at high modem speeds.   The  checksum
method  is clearly faster,  but cannot catch communication errors
as  effectively.   The CRC method has an upper end throughput  of
500 bytes per second.

     I place a help call on CompuServe and receive an  intriquing
response  from Bela Lubkin.   The algorithm he suggested appeared
to be much quicker than the classical bit shift method.   It  was
not at all clear to me how the algorithm worked.   The purpose of
this  note  is  an  explanation of how and  why  a  table  lookup
algorithm can work.

     Throughout  this note,  I will use "effective throughput" to
mean:
  (SizeOfPacket * NumberOfPackets)/(TimeOfEot - TimeOfInitialSOH)
All source code will be in C.  These programs were written to run
on an Apple MacIntosh, which is equipped with a high speed serial
port.
     Below  is  the  standard  bit  shift  and  occasionally  XOR
algorithm.   The condition for doing the XOR is simply, if a 1 is
going  to be shifted out of the sign bit then shift and  XOR  the
old  CRC  with  the  generator.   Note that  the  inner  loop  is
performed exactly eight times.
________________________________________________________________
___Fig.1: Bit Shift CRC_________________________________________

unsigned short ComputeCrc(crc,bufptr,len)
register unsigned short crc;   /* unsigned shorts are 16 bits */
register char *bufptr;
register short len;
{
     register int   i;
     while (len--) {           /* calculate CRC from end to start*/
          crc ^= (unsigned short) (*bufptr++)<<8;
          for ( i=0 ; i<8 ; ++i ) {
               if (crc & 0x8000)
                    crc = (crc << 1) ^ 0x1021;
               else
                    crc <<= 1;
          }
     }
     return(crc);
}
________________________________________________________________

     After  exactly eight shifts,  there will be no carries  from
the low byte out of the top of the high byte.  This leads us to a
critical observation:

Ob  1:  Whether  or  not an XOR is done with the CRC  does  *not*
depend on the low byte of the CRC.

     Let A,  B,  and C be sixteen bit unsigned quantities.  We'll
use the standard C notation for XOR,  a carat (^).  The following
mathematical properties hold.

Ob 2:  XOR is Associative and Commutative.  I.e: For any choice
of A, B and C; these equations will always hold true:
     A^B = B^A                                    (1)
     (A^B)^C = A^(B^C)                            (2)

     Likewize,  using  the standard C notation for a  left  shift
( A<<B means A shifted left B times ),  it can be shown that left
shift is distributive over XOR.

Ob 3:  Left shift distributes over XOR.  I.e: For any choice of A
and B; this equation will always hold true:
     (A^B)<<1 = (A<<1) ^ (B<<1)                   (3)

     Now  take  a close look at the inner loop of the  ComputeCrc
algorithm.   First  decompose the variable 'crc' into two  parts,
'crch' and 'crcl'.   The variable 'crch' is simply the high  byte
of  'crc'  with the low byte changed to all  zeroes.   Similarly,
'crcl' is the low byte of 'crc',  with zeroes in the high half of
that word.  To create 'crc' from 'crch' and 'crcl' we can use:

     crc = crch ^ crcl

The first pass through the loop yields:

     crc = ((crch ^ crcl)<<1) ^ P1

wherein P1 is either 0 or 0x1021. After two iterations we have:

     crc = {[ ((crch^crcl)<<1) ^ P1] << 1} ^ P2
or:
     crc = (crch << 2) ^ (crcl << 2) ^ (P2 << 1) ^ P2

wherein P2 is either 0 or 0x1021.  Looking ahead we can write the
equation for crc after eight iterations:

     crc = (crch << 8) ^ (crcl << 8) ^ ((P1<<7) ^ (P2<<6) ^..^ P8)

     The term (crch<<8) vanishes, and (crcl<<8) is easy enough to
compute  so the only question remaining is how we  can  calculate
the last term.  Observation one tells us that the third term only
depends  on  the initial value of crch.   This means that we  can
compute  the third term by computing all possible values  in  the
inner  loop  for  crc = 0x0000 to 0xFF00 stepping  by  0x100  and
storing  the  values  in an array  CrcTable,  indexed  by  values
running  from 0x00 to 0xFF.   This is exactly the function of the
algorithm presented below:


________________________________________________________________
___Fig.2: Table lookup CRC______________________________________


unsigned short CrcTable[256];   /* declaration for the table */

InitCrcTable()
/* this routine calls ComputeCrc, defined in Fig.1 */
{
     int count;
     char zero;
     zero=0;
     for ( count=0 ; count<256 ; count++ )
          CrcTable[count] = ComputeCrc(count<<8,&zero,1);
}

unsigned short TableCrc(crc,bufptr,len)
register unsigned short crc;
register unsigned char *bufptr;
register short len;
{
     while (len--)
          crc = CrcTable[crc >> 8^(*bufptr++)] ^ crc<<8;
     return(crc);
}

________________________________________________________________

Thus our third term is found by:

     ((P1<<7) ^ (P2<<6) ^ ... ^ P8) = CrcTable[crch>>8]
the final crc is caluclated by:
     crc = crc_table[crch>>8] ^ (crcl<<8)

Substituting  this  relation into the algorithm  in  Fig.  1,  we
easily obtain the algorithm in Fig. 2.

     That  much  having been said,  it only remains to  be  seen,
whether  or not it has all been worthwhile.   How much faster  is
table lookup?
     After building a 128 byte sector filled with random numbers,
I  ran  a  benchmark  in which the CRC  and  Checksum  were  each
calculated 100 times with the following results.   Here each tick
represents 1/60th of a second.

________________________________________________________________
___Table.1: Comparison of methods for 100 sectors_______________

Slow CRC            120 ticks
Table CRC            25 ticks
Checksum              3 ticks
________________________________________________________________

     The table driven method is an order of magnitude faster than
the older CRC method,  and only an order of magnitude slower than
the Checksum method.   This offers a much smoother trade off than
originally  available  with  the  CRC  check.    What  do   these
differences mean in terms of effective through put?
     My   communications   program   also  uses   the   following
optimizations which can affect transfer speed:
     All disk I/O is double buffered.
     The  transmitter sends the body of the Xmodem  Packet  while
simultaneously calculating the CRC.  This leads to an almost zero
calculate time at low speeds.
     The machine on the receiving end of the line only calculates
the CRC after a complete packet has arrived.
________________________________________________________________
___Table.2: Transfer measurements_______________________________

               Effective Throughput: (bytes/sec)
     Baud:     1200      2400      9600      19200     57600
Method:        (128 byte packets)
Slow CRC       106.......201.......598.......770.......773
Table CRC      109.......210.......692.......1113......1172
Checksum       110.......213.......709.......1155......1263

Method:        (1k byte packets)
Slow CRC       ..........210.......628.......939.......994
Table CRC      ..........220.......729.......1185......1185
Checksum       ..........222.......745.......1226......1973
________________________________________________________________

     One  should take benchmarks with a grain of salt.   However,
it  is clear that the table lookup algorithm stays closer to  the
checksum  method  than the standard  bit  shift  algorithm.   The
efficiency  and  speed  of compiled code can  vary  greatly  from
machine  to  machine.   These tests were made with two  computers
linked directly through their serial ports.
     I have found that CompuServe typically yields and  effective
throughput  of  60-80  bytes  per second  at  1200  baud.   Genie
typically yields 70-80 bytes per second,  and my neighborhood Vax
785 lets me run with an effective throughput of 103-105 bytes per
second, all with Xmodem.
     In  my first attempt at Xmodem,  I used a function call  for
processing each character in the crc loop.   In addition,  I  did
not  overlap the CRC calculation with transmission of the packet.
This  was  a  dreadful mistake.   I could not  get  an  effective
throughput  of more than 500 characters per second even  with  4k
buffers  at any speed.   The bottom line is that it take a number
of optimizations to speed up a complex process.