Patrick Cloutier, Cristian Tibirna , Bernard Grandjean et Jules Thibault
Département de génie chimique , Université Laval
Sainte-Foy (Québec) CANADA G1K 7P4
program NNFit (Neural Network based
data Fitting) allows the development of
empirical non-linear correlations, by using artificial neural network models:
the multilayered Perceptron. NNFit is a non-linear
regression software allowing to find relationships between a set of input
variables, Xi (1 < i <
I), and a set of output variables Yk (1
< k < K) given a set of N relevant experimental data, [ Xi
,Yk ]n (1 <
n < N). The user of NNFit, as any other
empirical modeling approach, must keep in mind that the quality of the
regression models Y=f(X) he may obtain, will depend on
the relevance and the quality of the available experimental data. In addition,
it is important to stress that the success in getting a regression model that
fits well a given set of data, is absolutely not a guarantee that such a model
will have generalization capability. It means there is no guarantee the model
will predict correctly a new set of data: [ Xi
,Yk ]m (N+1 <
m < P).
The authors and the Laval University could not be kept responsible for the use of the models developped by using the NNFit program.
The reader is refered to literature for an exhaustive presentation of the neural network models (On Internet -> ftp://ftp.sas.com/pub/neural/FAQ.html).
Basically, the user which
is not familiar with neural network paradigm, should consider here a neural
model as a non-linear regression model that gives a relationship between
a normalized input vector U and a normalized output vector S. The
transformation S=f(U) is represented by a multilayered
neural network with a single hidden layer as illustrated below:
Figure 1. Schematic representation of a neural network with layers"
model uses, beside the variables of the problem, a constant input, named the
bias, and equal to 1 which is imposed both for the input and hidden layers.
Model equations are given below.
For the neurons of the input layer:
Ui: vector of normalized input variables, UI+1 =1
Hj: output of neuron, j, from the hidden layer, HJ+1=1
For the neurons of the output layer
f: sigmoid function: f(z)=1/(1+e-z)
Sk: vector of normalized output variables
The transformation of actual variables (X,Y) to normalized variables (U,S) is given by:
Remark: If variable Xi (or Yk) covers many decades, the use of log10 Xi (or log10 Yk) could be prefered and then, the normalization is written as below:
As all regression models, the neural model contains many fitting parameters:
- the value of J, the number of nodes in the hidden layer
- the values of the parameters Wij et Wjk, known as the wheights
J being varied and choosen empirically , the model contains then [(I+1).J + (J+1).K] fitting parametres which are determined by regression on a pertinent set of N pairs of experimental data.
The method consists in minimizing a quadratic criterium (sum of the squares of the prediction errors, absolute errors, Qa, or relative errors, Qr) by using as minimization algorithm a quasi-Newton method of type BFGS ["Numerical Recipes, The art of Scientific Computing" Press W.H. et al. Cambridge University Press 1986.]. This step of minimization of the quadratic criterium, which allows to fit the model on a set of data, is known as "learning" and the set of data used at this step is thus called the "learning file".
The quadratic criteria Qa or Qr to be minimized, refer to absolute or relative errors and are defined as it follows:
Warning: If one uses the relative criterium, none of the values of Skexp(n) may be equal to 0.As the normalization is done on Sk variables in the range [YMINk - YMAXk], the user must then chose an YMINk value slightly smaller than the minimum Yk value observed in the file. Thus, the normalization will result in Sk values striclty positive.
Overfitting of neural models is a well known problem. Because of their great plasticity, neural models may predict with a great accuracy the set of data on which the fitting has been done whereas some large prediction errors may be observed when the model is tested on a new data set .
To easier the understanding of overfitting, let's consider the variation Y=f(X) presented in the figure. One disposes of a limited number of experimental measurements. Let's suppose that a phenomenological model based on first principles exists as presented in the figure. Overfitting corresponds to the situation where the neural model is able to predict accurately the available experimental data, whereas it is completely in error when apply in the intervals between the learning data. Obviously, such a model is not reliable and its uses is not recommanded.
One approach for highlighting the overfitting consists in splitting the initial data file in two parts. The first part represents the learning file on which the minimization is performed. The other part (generalization file) will be used for testing the generalization capability of the model . Recall that the minimization algorithm works by an iterative process in which different values of the wheights (Wij and Wjk ) are explored in order to minimize the quadratic criterium. It is then possible to verify the predictions of the model on the generalization file at each iteration of the optimization algorithm.
The previous figure represents schematically the variation of the sum of squares of the errors on the learning file and on the generalization file depending of the number of iterations. The raise of this sum on the generalisation file is an indicator that overfitting occurs. In order to avoid this problem, it will be desirable to stop the minimization routine after only N1 iterations instead of letting the algorithm to get convergence after N2 iterations.
This procedure of stopping the minimization before convergence is called the early stopping method and is proposed in NNFit.
The available N experimental data must be in the form of a file containing digits and having N lines and P columns; the columns must contain the values of X and Y. The order of columns is free and, eventually, some of the columns may not be used in modelling.
Warning: use the decimal dot and not the decimal comma.
N.B. In order to facilitate the learning of the NNFit program, a data file, demo.dat (441 lines, 4 columns) is part of the distribution package (data have been simulated following the equations: y1=2x1+3x2-4x1x2 and 2=-5x1+8x2+7x1x2) and this file will be used further for illustrating the different functionalities.
Run NNFit and the following menu bar will be displayed:
New - for solving a new problem: it generates the configuration file (name.cfg) of the problem and allows to run optimization.
Open- for loading an existing configuration file and runing optimization once again
About - for displaying the names and the addresses of the authors
Click on New and choose the data file on which the modeling is to be run (example: demo.dat).
A working sheet appears and all the information required to develop the model must be specified.
In the superior frame the Information on the data file is appearing (demo.dat, 441 lines et 4 columns)
The second frame, titled File partition for generalization, allows the user, if needed, to split the initial data file in two parts, in order to identify possibly the overfitting problems. In order to do this partition, the number of data pairs retained for generalization must be specifed, by chosing the % or the number of line to take from the initial file. Two partition methods are proposed:
- random method (by using this method the number of data lines retained in the newly created files will be near but not necessarely identical to the required values)
- continuous segment method (use the sliding cursor or the input boxes to indicate the size of the segment, its beginning and its end).
Once the required information are conveniently indicated, click on the Partition button for creating two new files. These learning and generalization files are identified respectively by their extensions: .axx and .gxx (xx may vary between 01 and 99); the root of their names is the one of the initial file (example: demo.dat will be splitted in demo.a01 and demo.g01).
In the third frame, titled Network dimensions, the information specifying the number of input variables (without bias!), the number of output variables and the number J of hidden nodes (without bias!) must be entered here. The number of hidden nodes may vary between choosen minimum and maximum values. Recall that a family of (J max - Jmin +1) models will be created and the best model will be selected later by using the Simulations comparison facility offered in the Other menu. Click on Validation for recording this information.
( Example: for the file demo.dat, the problem comprises 2 inputs, 2outputs and a variation of J between 1 and 6 is proposed)
Warning: In a modeling problem, the variables which have to be predicted (in other words the output variables) are known. On the other side, the selection (or the identification) of the input variables, which are relevant to the problem, is sometimes quite difficult. If the user needs to calculate the correlation coefficients between columns in the initial data file, the Correlation matrix facility in Other menu may be used.
In the lower frame, Inputs and Outputs Assigning, the structure of the model must be associated with the structure of the data file. Select Inputs or Outputs, then, choose the number of the input (or of the output) and associate it with the corresponding column (Example: demo.dat: the problem considers 2 inputs, 2 outputs, all in normal reading mode: input 1 on column 1, input 2 on column 2, output 1 on column 3, output 2 on column 4.). The minima and the maxima of the data file column are displayed automatically but the user may change the normalization range for each variable if needed (warning: the user must change especially the values of the minima for the outputs if the convergence criterium based on the relative errors is selected, see paragraph 2) . The subframe Reading mode allows the use of the decimal logarithms of the variable's values instead of its actual values. This operation is particularly recommended in the case of a variable range covering many decades. For each choosen input (or output), the information must be recorded by clicking on the Validate button.
The button Option,which pops up the window displayed below, allows to select various parameters which are needed for the minimization algorithm (the default values are those selected in the window Preferences in the menu Others):
- choice of the quadratic criterium to minimize: sum of squares of the relative or absolute errors
- choice of the maximum number of iterations in the optimization routine and of the convergence criterium
Remark: The minimization algorithm stops when one of these two criteria is satisfied
- choice of initial wheights. The option Use if they exist permits, if a simulation has already been completed, to initialize the wheights to the values found during this last simulation. If not, the initial wheights values may be choosen in a random way between the Maxi and Mini values settled by the user, or fixed alternatively to these minimum and maximum values (ex: W11=Min, W12=Max, W13=Min, W14=Max etc...).
Warning: The choice of the initial values of the wheights may have a significant effect on the convergence of the optimization routine (with the possibility toconverge to local minima).
Once all the information is provided in the configuration sheet, click Save. This operation creates and saves a configuration file with a name choosen by the user (the .cfg extension is automatically added) (ex: demo.cfg), this name will be afterwards used for identifying the results of the simulations. For starting the calculus, click on Start. After that, the user may continue the use of NNFit or he may quit the program (click Quit, but the launched simulations will continue to run in background until completion).
This section of the program allows to view the calculus results. Iteration displays the evolution of the quadratic criterium choosen for minimization as a function of the number of iterations. Prediction or Validation allows to compare the calculated values with the experimental ones on both the learning and generalization files or on an other file, on which the user has validated an already built model.
After selecting Iteration, one must select the files having an *.ite extension in the case when a partition wasn't made on the initial data file or an *.ita or *.itg in the other cases. In this later situation, the simultaneous display of the two variations allows the user to identify eventually the overfitting and to choose accordingly the optimal number of iterations. Position the cursor located between the two graphs to the choosen number of iterations; with a click on the New calculus button, a second simulation, identical with the displayed one, will be done (the initial wheights are identical with those used for the first time, and are saved in the *.win files). The optimization routine will be stopped this time at the choosen number of iterations (this way the user can use the early stopping approach thus eluding the overfitting). The button Total allows to display the evolution of the total quadratic criterium Qa (or Qr) selected for minimisation. The button By output allows to display the evolution of the quadratic criteria Qa,k (or Qr,k) associated with each output (the button Previous or Next may be used to display the outputs k+1 or k-1). The buttton Print gives the possibility to print the graph on paper or in a file (for later printing).
After choosing the Prediction or Validation menus, one has to select the files having respectively the extensions *xx.pre, *xx.pra or *xx.prg (xx indicates the number J of hidden nodes of the model, wihtout bias) or *.vyy.
The results display contains a graph and a table and these may be directly printed on paper or in a file (for later printing).
The table having the title Information on prediction errors gives information about the prediction performances of the model:
- Prediction errors, median, minimum, maximum, standard deviation. It is possible to use the absolute error, the relative error or the absolute value of the relative error:
- correlation and determination coefficients defined below (it is preferable that they have values near to 1!):
The frame Learning <--> Generalization will allow the user to switch directly from the *.pra file display to the corresponding *.prg file display (or conversely).
The display in the window called Predictions visualization allows to compare the calculated and the experimental data. There are two available display options:
When the model has multiple outputs, use the buttons Previous or Next to display the results of the outputs k+1 or k-1.
This menu allows to use the models built in previous simulations. For each of the options listed here, the user is invited to choose the wheights file, namexx.w, which caracterises the model to test.
- Validate on file permits to test the model on a new inputs/outputs data set of the problem.
By default, this new file is supposed to have a structure identical to the one indicated in the configuration file (if available) associated with the wheights file. If not, the user must proceed to the selection of the inputs/outputs (as previously described in section 5.1). Click on the button Start for launching the calculus and the results will be stored in the file namexx.vyy (yy goes from 01 to 99, a new number being affected automatically for each new test using the same wheights file, namexx.w). The files *.vyy may be visualized using option Validation of the View menu (see section 5.2).
- Simulate on file corresponds to the case when only the inputs are known and the outputs are to be predicted. The results are written in the file namexx.syy.
- Simulate on inputs allows to use a model for a singular simulation in which data entering is perfomed on the screen.
This menu allows to identify the simulations running at a moment and eventually to destroy them.
Remark: After an abnormal interruption of a simulation, its PID number will still be displayed in this dialog, even if the process does not exist anymore.
Correlation matrix allows to calculate the correlation coefficients, between the columns of a given data file. Between the columns p and q, the correlation coefficients apq are defined as:
Simulations comparaison allows to quantify the effect of the number of hidden nodes (J) on the fitting of the models over the data files (learning or validation). The user selects the name of the configuration file (*.cfg) and the following table is displayed.
This window puts together, for all the explored values of J (J=number of hidden nodes in the hidden layer), the values of the mean errors, of the standard deviation of the errors and the correlation coefficient for each output (K), and also for the learning and generalization files.
This table might ease the choice of the best model to retain.
This option allows to select the default values of cvarious modeling parameters.
6.1 namexx.ite: (iterations file)
6.2 namexx.pra, ou . prg ou .pre ou .vyy: (prediction file when experimental outputs are available).
6.3 namexx.w(wheights file)
6.4 namexx.syy (prediction file when the outputs are not known previously)
6.5 name.cfg (configuration file)
Data file: demo.dat
Lines number in the data file: 441
Columns number in the data file: 4
Random partition: YES
Lines in the generalisation file: 125
Learning file: demo.a01
Generalisation file: demo.g01
Maximum column no 1: 1.000000e+01
Minimum column no 1: 1.000000e+01
Maximum column no 2: 1.000000e+01
Minimum column no 2: 1.000000e+01
Maximum column no 3: 4.100000e+02
Minimum column no 3: 4.500000e+02
Maximum column no 4: 7.300000e+02
Minimum column no 4: 8.300000e+02
Relative criterium: no
Maximum number of iterations: 200
Use of the initial wheights file? non
Maximum of wheights: 1.000000e01
Minimum of wheights: 1.000000e01
Distribution type for wheights: Random
Input column no 1: 1
Input column no 2: 2
Output column no 1: 3
Output column no 2: 4
Norm_entree_max( 1): 1.000000e+01
Norm_entree_min( 1): 1.000000e+01
Mode de lecture( 1): Normale
Norm_entree_max( 2): 1.000000e+01
Norm_entree_min( 2): 1.000000e+01
Mode de lecture( 2): Normale
Norm_sortie_max( 1): 4.100000e+02
Norm_sortie_min( 1): 4.500000e+02
Reading mode( 1): Normal
Norm_sortie_max( 2): 7.300000e+02
Norm_sortie_min( 2): 8.300000e+02
Reading mode( 2): Normal
6.6 Hidden files
The program creates the following hidden files:
.nnFit, .nnfit_gnuplot; .nnfit_preferences; .nnfit_impression
a model is obtanied, the user may employ it outside NNFit, by using the generated wheights
file corresponding to this model. The user may use a code similar to the fortran example given below.
* The file namexx.w contains the parameters of the models:
do 10 i=1,ii
do 20 i=1,kk
do 30 i=1,ii+1
do 40 j=1,jj+1
* initialize the inputs vector x : x(i)= ? , de i=1 a ii
* the outputs are in the vector y*
* the network equations
do 100 i=1,ii
if (xmod(1,i).eq.0.) then
do 110 j=1,jj
do 120 i=1,ii+1
do 130 k=1,kk
do 140 j=1,jj+1
Compiled versions of the program are available for Unix, under the X environments of HP, IBM-AIX, SGI, SUN and also under LINUX.
NNFit uses the X-Window system and the GNUPLOT program which are freely available on Internet.
The graphical interface of NNFit uses the public domain library XForms version 0.81, written by Dr. T. C. Zhao and Mark Overmars and improved thanks to a netwide effort. This library and its documentation are available through the WWW: http://bragg.phys.uwm.edu/xforms
The on-paper traced curves are plotted with GNUPLOT, which is a part of the GNU project.
The development of NNFit was driven on many Intel x86 PC (compatible IBM) computers with Linux, a freeware fully featured unix-like operating system.
Professors Grandjean and Thibault address their aknowledgments to the Research Council on Natural Sciences and Engineering of Canada (CRSNG) for the fincancial support. Last modified: May 12, 1997, 23:05h EST