Jeffrey: Ok, so we’ve established that outliers can be our friends, but they cause many of the statistical metrics to behave oddly. Thus, we temporarily suppress them to develop an accurate profile of the data and then decide whether we will keep them (good data) or delete/fix them (bad data). But I want to move on to move on to some details about regression analysis. There are lots of misunderstandings around regression analysis that often result in mistakes or misuse!
Let’s start with some generalities that folks should keep in mind:
First, regression is a classical statistical technique that assumes the observations (data) are independent (no relationship to one another). Thus, as earth modeling data is generally dependant (it matters whether you drill a well north, south, east, or west of a producing well) it is not and should not be confused for a spatial estimation technique! For example, you would never want to take a map of, say, porosity and transform it to permeability via a regression equation! I know, that sounds like heresy, but it’s true! I’ll let my esteemed colleague explain…
Rich: Jeffrey you pointed out one of the big differences between classical statistical assumptions and geostatistics (spatial statistics) and that is whether there is a spatial dependency between measurements. Most of the data types we deal with in the petroleum industry have a coordinate associated with the measured values. Recall that simple that a simple linear regression model has only one weight (the slope term) which is used to “transform or scale” one measurement type into the other. The use of a single weight in a “mapping” exercise makes the assumption of data stationarity, or no trend in the data. Becasue the stationarity assumption is often not valid with most of our data sets, the estimated values will also show a trend in the estimation error. Another attribute of the regression model is the tendency to over estimate low values and under estimate high values. We should also note that unless some error term is introduced into the regression model, it is not possible to reprodcution the inherent scatter we see in the cross plot in the final results. Please note that there is nothing inherently wrong with regression models when applied to aprropriate data sets which aren’t spatially dependent; there are many excellent examples of regression modeling in other sciences.
Jeffrey: It matters how you set up the axes in a regression plot! First, by tradition, the dependant variable is always plotted on the Y axis and the independent variable is plotted on the X axis. Further, the “harder” physical variable (when you know it) is always plotted on the X axis. For example, as contrary as this may seem, if you were creating a regression of acoustic impedance with porosity, most folks will plot porosity on the Y axis and AI on the X axis. This would be incorrect! It matters which variable is plotted on which axis because the resulting equations are not the same, and different answers will be produced.
Rich: You are correct about setting up the axes correctly and let me explain why. Traditional regression is a curve fitting which uses an equation to create a “best fit line” through a cloud of points which minimizes the differences between the data points along the Y-axis and the fitted line. The general form of a linear regression model is:
Y (dependent variable) = a (constant) ± b(slope) * X (independent variable)
The sign of the “b or slope” term depends upon where there is a direct positive correlation or an inverse (negative correlation). If the the slope is determined by minimizing the error along the X-axis then the slope changes and so does the equation; this would have the same effect if the role of independent versus dependent variables are changed. Let me give you an example: Suppose that our objective is to compute porosity (PHI) from acoustic impedance (AI). Typically PHI is put on the Y-axis and AI on the X – and we get an equation of the form PHI = a -b(AI). However, we assume that a change in AI casues a chage in PHI, which doesn’t make sense. So we should reverse the axes and come up with the correct relationship and the equation AI = a -b(PHI) and the solve for PHI, which is PHI = (AI -a)/-b. See Davis, J. C., 2002 (Third Edition), Statistics and Data Analysis in Geology, Second Edition, John Wiley & Sons, New York, pages 204-207.
Jeffrey: At the risk of getting ahead of myself, the other thing you don’t EVER want to do is use any kind of linear or non linear regression to fit a variogram! Variogram modeling is NOT a curve fitting exercise! We see this in some software package and this is misleading and substantially incorrect.
Rich: If a regression model were used to “fit” the experimental variogram and used in kriging it is very likely kriging variances would not satistfy the “positive definite” criteria; that is, the kriging variance must be ≥0, a condition NOT guaranteed by a regression model. Currently there are 14 authorized equations (which includes the Nugget model) that can be used to model the experimental variogram.
Jeffrey: Regression analysis is a very valuable technique and used for its designed purpose can be very powerful. In data analysis, it SHOULD be used to determine the type and strength of the relationship between two or more variables (with appropriate set-up of the axes). It should NOT be used as a spatial estimation conversion technique, nor as as method for modeling the variogram.
Rich: In data analysis we use the cross plot to establish a visual relationship between two attributes and use the regression method to compute and display the correlation coefficient. If we are only after the correlation coefficent then it doesn’t matter which axis we plot the variables, but if we decide to use the regression equation, then it does matter which popoerty is plotted on the Y-axis. A final note related to the correlation coefficent; often it is the R-Square value that is plotted on the cross plot and not r, the correlation coefficient. Don’t confuse these values. R-square is the coefficent of determination and tells how much of the variablity in the dependent variable (Y-axis) is explained by its relationship to the independent variable and ALWAYS has a positive sign. The correlation coefficient tells us the strength of the relationship and can have a positve or negative sign. The significant of the correlation depends upon the number of data points and can be tested using Student’s t-test (one-tail) to test if the correlation is differnt from 0, or no correlation.