                    
                              UPDATE (part 1)
                              ~~~~~~~~~~~~~~~
         -  Changes since publication of the manual for PEPI Version
            2.0 (these are described in more detail in part 2; see 
            UPDATE2.TXT).

         -  Documentation for LOGISTIK and LOGX.
           
         -------------------------------------------------------------
           
                 MAIN CHANGES SINCE PUBLICATION OF THE MANUAL
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         PEPI Versions 2.07 and 2.07a
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         -  New: LOGX (a second multiple logistic regression program,
            slower than LOGISTIK but capable of handling more extensive
            data).

         -  In Version 207a an error was corrected in the treatment of 
            missing values in conditional logistic regression (LOGISTIK 
            and LOGX).  This is the only difference between Versions 2.07 
            and 2.07a. 
                  
         -  Enhanced user-friendliness: each program offers more
            information, including a brief description of what it does,
            and it is now easier to leave a program during the
            data-entry phase.
            
         -  PAIRS offers an option for the comparison of paired values,
            assuming a lognormal distribution.
            
         -  MATCHED computes the conditional power of Walter's test
            for binary data (comparing cases with multiple matched
            controls).
            
         -  POOLING (which combines probabilities) has been
            incorporated in COMBINE (which combines measures of
            association).
            
         PEPI Version 2.06
         ~~~~~~~~~~~~~~~~~
         -  New: LOGISTIK program (multiple logistic regression 
            analysis) 

         -  New: Program Finder (to identify programs to meet specific 
            needs) for DOS (FINDER.COM, supplemented by PROGRAMS.COM, 
            which provides thumbnail descriptions of the programs). 
           
         -  SCRN provides additional results for a test that provides a 
            range of values: the area under the ROC curve, and the 
            optimal cutting-points under different conditions. 
   
         -  Improvements in computation by CONFINT of exact confidence 
            intervals for ratios of person-time rates with very large 
            numerators and for very large proportions in large samples. 
                                 
         -  RATES2 provides exact confidence intervals (in single 
            strata) for larger numbers than previously. 
                                          
         -  KAPPA provides an alternative overall kappa (based on the 
            common correlation model) for analyses of multiple strata, 
            with a heterogeneity test and confidence intervals that are 
            appropriate even for fairly small samples. 
            
         -  For convenience, RANDOM now uses a compressed format (six 
            results per line) for saving or printing simple random 
            samples and sequences. 
         
         PEPI Versions 2.01 to 2.05
         ~~~~~~~~~~~~~~~~~~~~~~~~~~            
         -  New: Program Finder (to identify programs to meet specific 
            needs) for Windows. 
         
         -  The DerSimonian-Laird procedure has been added.  This uses 
            a random-effects model when computing an overall measure of 
            association for a set of studies or strata, and may be 
            helpful in meta-analyses based on studies with 
            heterogeneous results.  It is provided as an option in 
            CASECONT (for odds ratios), RATES1 and RATES2 (for risk and 
            rate ratios), and COMBINE (for other measures). 
          
         -  EXACT2XK is now faster, and CASECONT and RATES2 are much 
            faster when computing exact probabilities and exact 
            confidence intervals for odds ratios and person-time rate 
            ratios for stratified data
            
         -  CASECONT can now compute the exact power of Fisher and mid-
            P exact tests. 
          
         -  MATCHED and PAIRS now provide exact probabilities and 
            confidence intervals for odds ratios, and CONFINT provides 
            approximate mid-P confidence intervals for proportions and 
            rates when it does not compute exact intervals. 
         
         -  COMBINE can now compute and combine effect sizes
            (standardized differences between means), and can accept 
            confidence intervals for measures to be combined, instead 
            of standard errors. 
         
         -  SCRN now displays ROC curves (with 95% confidence bounds) 
            showing the association between the sensitivity and false 
            positive rate of a test that provides a range of values. 

         -  KAPPA now provides adjusted kappa values (for two ratings 
            using a dichotomous scale) that compensate for the effects 
            of bias and unequal prevalences of the categories. 
         
         -  DIFFER can now estimate the effect of regression to the
            mean.
         
         -  WHATS can now compute permutations and combinations.  The
            factorial function has been extended, and now accepts
            numbers up to 1,754.

         -  Hewitt's rank-sum test for seasonality was added to 
            SEASONAL. 

         -  EXACT2XK is now faster, and CASECONT and RATES2 are much 
            faster when computing exact probabilities and exact 
            confidence intervals for odds ratios and person-time rate 
            ratios for stratified data. 
         
         -  The accuracy of the function for estimating F from a P 
            value has been enhanced.  The modified function is used in 
            CONFINT, PVALUE and PAIRS. 
         
         -  Errors were corrected in MANNWHIT (in the handling of zero 
            and negative values entered in the Wilcoxon test) and 
            MATCHED (in Walter's test for a fixed number of controls).

         --------------------------------------------------------------

         LOGISTIK and LOGX   Logistic regression (unconditional and
         ~~~~~~~~~~~~~~~~~   conditional)
         
           [This description applies to both LOGISTIK and LOGX.  LOGX
           uses extended memory and can handle much more data, but is
           far slower and should be used only if LOGISTIK cannot cope.]
         
         This program performs multiple logistic regression analysis 
         and provides aids to the use of the results of such an 
         analysis.  
         
         It may be used in cohort studies and trials (where the 
         dependent variable is the occurrence or nonoccurrence of a 
         disease or other outcome), in case-control studies of risk or
         protective factors associated with a disease (where the
         dependent variable is membership of the case or control
         group), in studies that aim to determine how diagnostic or
         prognostic criteria can be combined to appraise the
         probability that a disease is present or likely to occur, and
         for other purposes.  For discussions of the use of multiple
         logistic regression in epidemiological studies, see (inter
         alia) Kahn and Sempos 1989: 148-165, Selvin 1996: 243-269,
         298-310, and Schlesselman 1982:227-275.
         
         The procedure measures the effects of single variables or
         combinations of variables on the log of the odds of (e.g.)
         occurrence of a disease.  The analysis permits control of
         confounding effects and appraisal of modifying effects;
         first-degree interactions can be included in the model.  
         
         The program calculates the regression coefficients (constant,
         a, b, etc.) in a regression equation of the format:
         
                  log-odds(Y=1) = constant + aA + bB + cC ...
                  
         where Y is a dichotomous dependent variable, Y=1 refers to
         one of its categories (e.g., the presence of a disease), and
         A, B, etc. are independent (explanatory) variables; the model
         may include interaction terms involving two variables, such
         as A*A (a quadratic term) or A*B.  Each coefficient (a, b
         etc.) reflects the influence of the relevant variable or
         interaction when other influences are held constant; it
         expresses the change in the log-odds for a unit change in the
         specific variable when all the other covariates in the
         regression model are held constant.
                                
         Special features of the program are:
         
         - It does both unconditional analyses (for unmatched data) and 
           conditional analyses (which are appropriate for matched or 
           finely stratified data; Selvin 1996: 298-310). 
                        
         - It will accept grouped as well as individual data.  If the 
           data have been summarized in a contingency table the 
           individuals in each cell of the table can be entered as a 
           group. 
           
         - It can treat independent variables either as "simple"
           (continuous) or as "factors" with up to 10 categories (each
           containing one or more values) defined by the user.  The
           categories can be regarded as nominal, each category
           (except the first) being contrasted either with the first
           one ("baseline") or with the preceding one; the program
           creates dummy variables for this purpose. The contrasts
           with the preceding category (Walter et al. 1987) permit
           close scrutiny of the effects of successive levels, e.g. in
           trials using different doses.  As a third alternative,
           a factor can be treated as a simple (ordinal) variable, its
           successive categories constituting its levels (1, 2, 3
           etc).
           
         - The program can "center" variables by subtracting the mean 
           from each observation; this reduces the effect of 
           collinearity (highly correlated independent variables), and 
           is especially useful if both a variable (e.g. age) and its 
           quadratic term (age-squared, or age*age) are included in the 
           model (Selvin 1996: 256-259; Breslow and Day 1980: 233-236). 
           
         - Missing values can be defined for specific variables, so 
           that individuals with missing values will be omitted from 
           analyses involving these variables; the whole analysis can 
           be restricted to a selected subgroup of the sample. 

         - The models to be fitted can be specified in three ways: by 
           entering a list of variables and (optionally) interactions; 
           by entering variables to be added to the previous model (a 
           likelihood-ratio test assesses the effect of the addition); 
           or by an automatic mode that creates a series of models, 
           each comprising a different selected "main effect" and a 
           uniform set of selected "control variables" (a likelihood-
           ratio test is done for each "main effect"). 
           
         - The model can include first-degree interactions; results are 
           displayed for each level of an interaction involving dummy 
           variables. 
           
         - The number of iterations and the convergence criterion and 
           tolerance criterion can be modified. 
           
         - The program provides the Hosmer-Lemeshow goodness-of-fit 
           test and several other indicators of the aptness of a
           model.
           
         - The program provides utilities to assist in use of the 
           results (or of logistic regression analysis results entered 
           at the keyboard).  It computes the predicted probability of 
           the outcome variable for given values of the variables
           (for use in cohort studies), and an odds ratio that
           expresses the contrast between two alternative sets of
           values of selected variables; approximate confidence
           intervals are estimated. It also computes the odds ratio
           for a specified difference in the magnitude of a value, and
           the odds ratio corresponding to a given coefficient; it
           tests the significance of variable(s) by comparing the
           likelihood statistics for two models.
           
         - The program offers a correlation matrix (Pearson's 
           correlation coefficients) for the terms in the regression
           model, as an aid to the exploration of collinearity. 
           Factors are treated as ordinal variables for this purpose.
           
         - The program has a word-processing facility, permitting the
           addition of comments to the saved or printed results.
           
         - Copious on-screen help is provided.
         
         - The program can read data either from an ASCII (text) file
           prepared in a suitable format (see below) with any word
           processor, text editor or data entry program, or from a
           .REC file created by the Epi Info program for epidemiology
           on microcomputers (Dean 1994).  The program can create a
           dictionary file to record information about the variables.
         
           
         The program computes regression coefficients (with their
         standard errors) and the corresponding odds ratios (with 90,
         95 or 99% confidence intervals).  For a simple variable, the
         coefficients and odds ratios express the effect of a change
         in magnitude of one unit.  For a factor, they express the
         contrast with the reference category (baseline or preceding)
         or (if the factor is treated as ordinal) the effect of a
         one-level change.  The program also displays z-scores (the
         coefficients divided by their standard deviations) for use in
         comparing the impact of different variables (Selvin 1996:
         272), crude odds ratios, a variance-covariance matrix, the
         log-likelihood for the model and for the null model, and
         the G statistic for the model (-2 times the log-likelihood).
         
         Three significance tests are provided:  The Wald test, which 
         uses the square of the z-score as chi-square, and may be 
         over-conservative for a large coefficient (Hauck and Donner 
         1977); the likelihood-ratio test, which is based on a 
         comparison of G statistics for models that do and do not 
         contain the variable(s) in question (i.e., nested models); 
         and the score test (Breslow and Day 1980: 207-208), which is 
         recommended for small samples.  For a factor that is treated 
         as ordinal, these are tests of trend. 
         
         Goodness of fit of the model is appraised by the Hosmer-
         Lemeshow test: individuals are arranged in a rising sequence
         of probabilities (as estimated by the model) and split into
         equally-sized "deciles of risk", in which observed and
         expected numbers (for both "yes" and no") are compared. A low
         P value indicates that the fit is poor.  A good fit does not
         necessarily mean that the results of the logistic analysis
         are valid, but a poor fit points to low validity.  The test
         result may be biased if there are tied probabilities (e.g. if
         grouped data were entered), since tied individuals may be
         spread over adjacent deciles, and their original arrangement
         with respect to observed status may persist.  To reduce this
         possible bias the individuals are randomly shuffled ten times
         before the test. If there are ties, reshuffling leads to a
         (randomly) different result; the program therefore permits
         repeated tests, and displays the median P value.
         
         The program provides several other indicators of the
         suitability of the model: the Pearson correlation coefficient
         between the observed value of the dependent variable (0 or 1
         = "no" or "yes") and the probability (of "yes") predicted by
         the logistic equation; its square, which is an estimate of
         the proportion of explained variation (Mittlboeck and
         Schemper 1996); and Darlington's "logistic regression fit
         index" (Darlington 1990: 449) and the pseudo R-squared value
         (Selvin 1996: 266), both of which are based on a comparison
         of likelihood statistics based on the full model and on the
         null model, and are not direct measures of goodness of fit.

         Requirements for the data file are described in a Help file
         displayed by the program.  A .REC data file can be prepared
         by entering data in the Epi Info program, or (using the
         IMPORT utility provided by Epi Info) by converting a data
         file from another data entry program.  If an Epi Info .REC
         data file is not used, the file must be in ASCII (text)
         format, and must contain numbers only; each individual or
         (for grouped data) each group must have a separate record, in
         which the values of the variables are separated by spaces or
         commas.  The dependent variable may have only two values: 1
         (for presence of the outcome, or cases) and 0 (for absence of
         the outcome, or controls), but the program can convert other
         values to 1 and 0 before the analysis.  Unless an Epi Info
         .REC file is used, the names and sequence of the variables
         must be supplied by entry at the keyboard or by loading a
         dictionary file.
         
         When LOGISTIK uses an Epi Info .REC file, it translates "Y" 
         and "N" to "1" and "0" respectively, "F" and "M" to "1" and 
         "0", other entries in single-letter fields to 9's, and blank 
         fields to 9's.  Entries in date formats and records that are 
         marked as "deleted" are not used. 
         
         LOGISTIK can prepare a dictionary file for subsequent use.
         This records the names and sequence of variables and
         specifying the dependent variable, the frequency variable
         (for grouped data) and matching variable (for matched data),
         whether variables are factors, and (if so) the number of
         categories, the cut-points, and the way the factor is to be
         treated in the analysis.
         
                                           
         Formulae
         ~~~~~~~~
         Basic statistical techniques for multiple regression analysis 
         are described by Breslow and Day (1980: 182-227). The 
         computation is based on MULTLR (Campos-Filho and Franco 1989), 
         which uses adapted algorithms from LOGRESS (McGee 1986) and 
         PECAN (Lubin 1981), respectively, for unconditional and 
         conditional maximum likelihood estimation of the logistic 
         coefficients. 
         
         Odds ratios are the exponentials (antilogs) of the 
         corresponding coefficients or log odds, and the predicted 
         probability of the outcome is 1 / [1 + exp(-log odds)].  The 
         odds ratio expressing the contrast between two alternative 
         sets of values is the exponential of the difference between 
         the log odds for the two sets.  Confidence limits for an odds 
         ratio or predicted probability are derived from the confidence 
         limits of the corresponding coefficient or log odds. 
                                  
         The confidence limits of a coefficient B are B + Z.SE and 
         B - Z.SE, where Z is the standard normal deviate
         corresponding to a specific confidence level and SE is the
         standard error of the coefficient.  The confidence limits of
         bB, where the weight b is the value of the variable (for a
         dummy variable the weight is 1 or 0, and for an interaction
         term it is the product of the values of the interacting
         variables), are b(B + Z.SE) and b(B - Z.SE).
         
         Confidence intervals for predicted probability are estimated 
         in the same way, using the log odds calculated from the 
         regression equation and the variance of the sum of the terms 
         (weighted coefficients, including the constant and 
         coefficients for interactions) in the equation.  The formula 
         for the variance (Hosmer and Lemeshow 1989: 103-106) is 

                           (bi.Vi) + (2.ci.Ci), 
                         
         where 
                            
            Vi = variance of coefficient 
            bi = weight (as defined above) 
            Ci = covariance (for each pair of coefficients) 
            ci = the product of the weights given to the terms in the
                 pair 
                 
         Confidence intervals for the odds ratio contrasting two 
         alternative sets of values are estimated in the same way. They 
         are based on the difference between the log odds for the two 
         sets of relevant variables (including their interaction terms, 
         if any).  One expression is subtracted from the other,
         unspecified terms and those that are held constant being
         cancelled out by the subtraction.  The variance of the sum of
         the remaining terms is calculated by the above formula.
                         
         Pseudo-R-squared is defined as
         
                                (LLN-LLM)/LLN 
                    
         and Darlington's "logistic regression fit index 1" as
         
                    {exp[(LLN-LLM)/N] - 1}/[exp(-LLN/N) - 1],
                    
         where 
         
            LLM = the log-likelihood for the model 
            LLN = the log-likelihood for the null model
            N = sample size

         The formula for the goodness-of-fit test (Hosmer and Lemeshow 
         1989: 140-145; Selvin 1996: 264-266) is
              
                         chi-sq = [(O-E)squared / E]
                         
         where O = observed number and E = expected number, counted 
         separately for 'Yes' and 'No' categories in each decile (20
         cells altogether).

         Note
         ~~~~
         An error in the conditional logistic regression procedure (in
         Versions 2.06 and 2.07) was corrected in Version 2.07a.
         
         Acknowledgments 
         ~~~~~~~~~~~~~~~ 
         We are grateful to Dan McGee for providing us with the Fortran 
         code for LOGRESS, Pat Anderson for permission to use his EDITWIN 
         word-processor unit, and Dr A. Negassa for his helpful comments.  
         An adaptation of TXT2UNIT (version 2.0) by J.J. Arenzon was used 
         to prepare the Help file about data bases, and EXECSWAP 
         (Kokkonen 1989) was used in the shell-to-DOS option. 
         
         References   
         ~~~~~~~~~~
         See UPDATE (part 3)


