Quick Jump to Conditional Simulation

PUBLISHED ON August 2, 2010   Data Analysis, conditinonal simulation   |   9 Comments »

 

(Jeffrey) Ok, here is a question.  For those of you who know about conditional simulation (and we will blog further on this later) we are considering the relationship between different realizations, each of which is said to be independant and equally probable.  What would happen if we cross plotted two realizations of porosity from the same data set (same parameters, data, spatial model, grid dimensions, etc), the only difference being the random seed to set the random walk? 

1. What would this cross plot look like?   

Let’s make it more complicated.  lets say I run 100 realizations of porosity from the same data set (as above) but then multiplied them by other realizations of some other variable.  For example, I have realizations for porosity, gross volume, and net volume (net-to-gross), and I want to calculate net pore volume for each realization.  Let’s say the Oil/Water contact is flat.  Let’s also assume that the reservoir is composed entirely of sandstone and shales.  We’ll consider “good” reservoir quality rock to be above a designated porosity threshold (if it helps you visualize, say 18%).  The only thing that varies is quality of the porosity (values of porosity in the inter-well space),  the value of net-to-gross, and of course, the random seed for each of the realizations.  

 1.  If I cross plot any two realizations of porosity, what does that cross plot look like? 

 2.  If I cross plot any two realizations of gross volume, what does it look like? 

 3.  If I cross plot any two realizations of net volume (net-to-gross), what does it look like? 

 4.  If I cross plot any two realizations of net porosity volume, what does that cross plot look like? 

 5.  Should I use the same random seeds for the realizations of each property?  That is, if I choose a random seed, say 123456, and run 100 realizations of porosity, should also choose the random seed 123456 for the other variables like net-to-gross? 

 The assumption here is that the initiating random seed generates a suite of 99 more random seeds to be used in the 100 realizations of a given variable.  If I choose the same initiating random seed (under the same grid conditions – same size, number of cells, etc) for different variable, I would generate the same suite of random walks in the same order.  So, here is the big question;

6.  If I perform some operation between realizations of different variables, say a multiplication of a porosity realization with a net-to-gross realization above some oil/water (or gas/water, etc.) contact, should I be operating on two realizations that have identical random walks?  Or, not?  Does it matter? 

 Let’s see who is thinking?

Why not use regression as a spatial estimator?

PUBLISHED ON July 16, 2010   Data Analysis   |   4 Comments »

 

(Jeffrey) We made the comment that regression analysis, while useful in many ways does not make a good spatial estimator.  A question that always comes up is, why not?  Let me set the stage, and see what kind of response we get… Let’s say I have a map of acoustic impedance (AI) from seismic data.  More specifically, I have 2D or 3D geocellular grid of AI.  What I would like to is create a map of porosith.  My workflow is the following:

 1.  I create AI at my point locations either as function of a sonic log or back calculated from my seismic grid.

 2.  I have measured porosity from logs (or core) at my point locations.

 3.  I cross plot the two variables and perform a regression analysis resulting in a correleation coefficient of .98 and an equation, something like:    Phi = AI(m) – k, where m is the slope and k is a constant. 

4.  I apply the formula at each node of the grid converting acoustic impedance into porosity.

 The result is a new map of displaying the distribution of porosity.  Would this be a good thing to do???

About Regression Analysis

PUBLISHED ON December 7, 2009   Data Analysis   |   No Comments »

 

Jeffrey:  Ok, so we’ve established that outliers can be our friends, but they cause many of the statistical metrics to behave oddly.  Thus, we temporarily suppress them to develop an accurate profile of the data and then decide whether we will keep them (good data) or delete/fix them (bad data).  But I want to move on to move on to some details about regression analysis.  There are lots of misunderstandings around regression analysis that often result in mistakes or misuse!

 

Let’s start with some generalities that folks should keep in mind:

 

First, regression is a classical statistical technique that assumes the observations (data) are independent (no relationship to one another).  Thus, as earth modeling data is generally dependant (it matters whether you drill a well north, south, east, or west of a producing well) it is not and should not be confused for a spatial estimation technique!  For example, you would never want to take a map of, say, porosity and transform it to permeability via a regression equation!  I know, that sounds like heresy, but it’s true!  I’ll let my esteemed colleague explain…

 

Rich: Jeffrey you pointed out one of the big differences between classical statistical assumptions and geostatistics (spatial statistics) and that is whether there is a spatial dependency between measurements. Most of the data types we deal with in the petroleum industry have a coordinate associated with the measured values. Recall that simple that a simple linear regression model has only one weight (the slope term) which is used to “transform or scale” one measurement type into the other. The use of a single weight in a “mapping” exercise makes the assumption of data stationarity, or no trend in the data. Becasue the stationarity assumption is often not valid with most of our data sets, the estimated values will also show a trend in the estimation error. Another attribute of the regression model is the tendency to over estimate low values and under estimate high values. We should also note that unless some error term is introduced into the regression model, it is not possible to reprodcution the inherent scatter we see in the cross plot in the final results. Please note that there is nothing inherently wrong with regression models when applied to aprropriate data sets which aren’t spatially dependent; there are many excellent examples of regression modeling in other sciences.

 

Jeffrey: It matters how you set up the axes in a regression plot!  First, by tradition, the dependant variable is always plotted on the Y axis and the independent variable is plotted on the X axis.  Further, the “harder” physical variable (when you know it)  is always plotted on the X axis.  For example, as contrary as this may seem, if you were creating a regression of acoustic impedance with porosity, most folks will plot porosity on the Y axis and AI on the X axis.  This would be incorrect!  It matters which variable is plotted on which axis because the resulting equations are not the same, and different answers will be produced. 

 

Rich: You are correct about setting up the axes correctly and let me explain why. Traditional regression is a curve fitting which uses an equation to create a “best fit line” through a cloud of points which minimizes the differences between the data points along the Y-axis and the fitted line. The general form of a linear regression model is:  

 

Y (dependent variable) = a (constant) ± b(slope) * X (independent variable)

 

The sign of the “b or slope” term depends upon where there is a direct positive correlation or an inverse (negative correlation). If the the slope is determined by minimizing the error along the X-axis then the slope changes and so does the equation; this would have the same effect if the role of independent versus dependent variables are changed. Let me give you an example: Suppose that our objective is to compute porosity (PHI) from acoustic impedance (AI). Typically PHI is put on the Y-axis and AI on the X – and we get an equation of the form PHI = a -b(AI). However, we assume that a change in AI casues a chage in PHI, which doesn’t make sense. So we should reverse the axes and come up with the correct relationship and the equation AI = a -b(PHI) and the solve for PHI, which is PHI = (AI -a)/-b. See Davis, J. C., 2002 (Third Edition), Statistics and Data Analysis in Geology, Second Edition, John Wiley & Sons, New York, pages 204-207.

 

Jeffrey: At the risk of getting ahead of myself, the other thing you don’t EVER want to do is use any kind of linear or non linear regression to fit a variogram!  Variogram modeling is NOT a curve fitting exercise!  We see this in some software package and this is misleading and substantially incorrect. 

 

Rich: If a regression model were used to “fit” the experimental variogram and used in kriging it is very likely kriging variances would not satistfy the “positive definite” criteria; that is, the kriging variance must be ≥0, a condition NOT guaranteed by a regression model. Currently there are 14 authorized equations (which includes the Nugget model) that can be used to model the experimental variogram.

 

Jeffrey: Regression analysis is a very valuable technique and used for its designed purpose can be very powerful.  In data analysis, it SHOULD be used to determine the type and strength of the relationship between two or more variables (with appropriate set-up of the axes).  It should NOT be used as a spatial estimation conversion technique, nor as as method for modeling the variogram. 

 

Rich: In data analysis we use the cross plot to establish a visual relationship between two attributes and use the regression method to compute and display the correlation coefficient. If we are only after the correlation coefficent then it doesn’t matter which axis we plot the variables, but if we decide to use the regression equation, then it does matter which popoerty is plotted on the Y-axis. A final note related to the correlation coefficent; often it is the R-Square value that is plotted on the cross plot and not r, the correlation coefficient. Don’t confuse these values. R-square is the coefficent of determination and tells how much of the variablity in the dependent variable (Y-axis) is explained by its relationship to the independent variable and ALWAYS has a positive sign. The correlation coefficient tells us the strength of the relationship and can have a positve or negative sign. The significant of the correlation depends upon the number of data points and can be tested using Student’s t-test (one-tail) to test if  the correlation is differnt from 0, or no correlation.

Outliers

PUBLISHED ON November 16, 2009   Data Analysis   |   5 Comments »
Note how statistics and correlation coefficient change when the outlier is present (a and b) and suppressed (c and d)
  
Note how statistics and correlation coefficient change when the outlier is present (a and b) and suppressed (c and d)
 
 
 
Jeffrey:  I want to continue with this conversation on data analysis and get to a substantial issue around cross plots and regression.  But, first I want to point out few important details about outliers. 

 1.  First, an outlier is not necessarily representative of “bad data.”  It’s simply substantially different than the other data being analyzed.  For example, you could have a single very high porosity value representating, say,  fracture porosity in amongst values representing matrix porosity.  The result would be a single “outlier” positioned on the far right of the horizontal axis in a histogram, or at the extremes of one or both axes of a cross plot.  In either case, it would be sitting off all by itself.  While the value may be real, it still throws off the basic statistical metrics (including a regression equation) which are generally sensitive to extreme values (see figure above).  The goal here would be not to “delete” that point, but to temporarily suppress it to better characterize the statistical profile of the bulk of the data you are analyzing.  As you suggest in your comments, one would see a bimodal or polymodal distribution if there were more than a single extreme value in which case you would isolate the various modal families and treat them independently.

2.  Also, just to be clear, it is always best to report the type of correlation value you are using to describe a cross plot.  There is big difference between reporting “r,” the correlation coefficient, and ”R2” the coefficient of determination.  The former reports the strength of the relationship while also identifying whether or not it is directly or inversely related, and the latter reports the total amount % variance explained by the predictor (independent) variable.   Any thoughts here?

Rich: Your first point is correct, the “outlier” is not necessarily an invalid piece of information, but is based on a statistical definition where the value when compared to the mean value is greater than some threshold of 2 to 2.5 standard deviations. If we surmise that the very high poroisty is representative of fracture porosity then we should isolate it as a subset and compute the statistics upon suppressing the high value. If this is the only high value in the data set and we have convinced ourself that the measurement is valid then we can also use it when we generate our map, and the high value will express itself as a local anomaly at it’s XY(Z) location.

 Regarding your second point, (2), There is much confusion about the meaning of both “r” the correlation coefficient and “R2”, the coefficient of determination.  So let me add a little more detail…  As you state, the correlation coefficient (r) informs us about

 the strength of the relationship between two properties ranging between -1 and 1; where -1 is a perfect inverse relationship and 1 is the perfect direct relationship. Note the correlation coefficient can be a positive or negative value, but this is not the case for R which is always positive for obvious reasons (it is the square of the correlation coefficient). Let’s say the correlation between two properties is -0.80 the coefficient of determination (R2) is 0.64, which is the same value when the correlation coefficient is 0.80. In both cases only 64% of the variability in the dependent variable is explained by its relationship to the independent variable. If only the R2 value is reported without an accompanying cross-plot, then we have no idea whether the properties relate directly or have and inverse relation. Just a side note, traditionally “r” is the symbol for the correlation coefficient, whereas as ”R2” is the coefficient of determination.  Unfortunetly in our business folks are not very good at using standard nomenclature.  This is what probably leads to some of the confusion we experience when reviewing reservoir studies.
 

Earth Modeling; Thoughts from Rich and Jeffrey about Data Analysis

PUBLISHED ON November 6, 2009   Data Analysis   |   No Comments »

Blogging is new experience for us, but it seems a good way to engage the Earth Modeling community in a open dialogue. I’d like to start blogging with the topic of data analysis.

Jeffrey: Over the years, Rich and I have seen a lot of projects and have had hundreds (if not a few thousand) of earth modelers come through our classes on geostatistics and earth modeling. Inevitably, the most common problems we see are around data analysis. In our classes we always ask what percentage of time is spent cleaning up the data and establishing the basic relationships between variables. The answer is consistently between 50% and 75%. That means that half to two-thirds of the project time is spend on this activity. That’s a huge effort, and unfortunately, often not accounted for in project design. This leads to increasing stress levels in the asset team as well as an increase in mistakes. Good data analysis takes time and some background in univariate and multivariate statistics! Yep, that means some math, and so many people shy away from it. But the results can help avoid outliers (values that seem out of range compared to what you might expect) and quantify the important relationships between variables that will be used in the modeling process. Examples could be porosity and permeability, or acoustic impedance and porosity. What do you think Rich?

Rich: Jeffrey you touch on a very interesting point about data analysis not given sufficient time in the project planning process. Regardless of what we think, our data bases aren’t as “clean” as we think. Some fundamental displays are required, for example, simple frequency histograms reveal a lot about our data distributions. We can answer basic questions about min, max, mode, etc about the data. Is there a single or multiple modes; if multiple modes we need to ask why. Could these be different data population related to different rock types, matrix versus fracture porosity, or maybe just a lack of samples? Histograms could also point to potential outliers, especially if we see only a few data points in a bin creating highly skewed distributions; these apparent outliers will create problems for us later.

Cross-plots provide us four basic pieces of information: 1) how one data type compares to another data type when measurements are made at the same locations; we need data pairs to make cross-plots; 2) are the two data types related directly (positive correlation) or inversely (negative correlation); 3) we also get the correlation value which tells us the strength of the data relationship; 4) finally we can get an equation which also us to predict one data type from the other.