Fitting generalized pareto distribution to data with zeros and censoring

Question

dandan am 24 Jan. 2018

1
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/378664-fitting-generalized-pareto-distribution-to-data-with-zeros-and-censoring

I am trying to understand the differences between two approaches to fitting a GP distribution to my data, and which one is most appropriate for data that has some zero values and some censored values. I'm attaching a .mat file with two datasets, x1 and x2. An example script is below, but I've had to add a few hacks (%Cut data... section at bottom) to deal with a range of errors that I would really appreciate help with understanding and fixing. These include:

Error using fitdist: The data in X must be greater than the threshold parameter : (if I don't cut/shift the x=0 data) fitdist (or alternately, gpfit) seems like it can only work with data greater than 0 (which it assumes is the threshold parameter theta). I also don't understand why this happens since gpfit (which seems to work the same as fitdist) says it does not estimate theta, which is assumed to be known and already subtracted from x--but if you were to subtract the minimum x-value, then you would turn all x=xmin into x=0. Either way, I'm not sure what to do about this since x1 has a bunch of x=0 data, which I don't think I should just be deleting or shifting to an arbitrarily higher value (e.g., x(x==0)=[]; or x(x==0)=0.001;)--although I have tried both to get past this error and see what else is causing problems, including....
Error using fitdist: Censoring not allowed with the generalized pareto distribution : (if I don't cut the x=>maxdist data and try to use the 'censoring' option that's commented out at the bottom of the script) this one all I can say is WTF why not?? I can find nothing in the documentation that mentions or explains this. Is there any way (aside from the MLE used in Method 1) to make Matlab fit a generalized pareto distribution with censoring?
Method 1 seems to be producing a censored CDF (doesn't reach 1) even for dataset x1, which doesn't have any values at/past its censoring point. This happens both for the cut data (weirdly censored) and for the uncut data (although this appears to be because the zero values force it to parameterize values of k=sigma=1 and theta=0).

So it appears that the MLE (Method 1) approach works well for Dataset 2 which actually has censored values, but weirdly cuts off the CDF on Dataset 1 either because of censoring (when it shouldn't have any effects from censoring) or because of the zero values, while the fitdist (Method 2) approach works well for Dataset 1 since it doesn't need to be censored, but can't be used for Dataset 2 which does need censoring. Any help anyone can offer with figuring out an approach that will work on both of these types of (uncut but censored) data would be much appreciated. Thanks!!

clear
%Load data and set maximum x value
load('testdata.mat','x1','x2')
maxdist=40;
xnew=0:0.01:50;
%Calculate PDF and CDF of data and fits
[x1cut,PDF1_mle,PDFc1_mle,PDFc1_fitgp,CDF1_mle,CDFc1_mle,CDFc1_fitgp]=fitgpparams(x1,maxdist,xnew);
[x2cut,PDF2_mle,PDFc2_mle,PDFc2_fitgp,CDF2_mle,CDFc2_mle,CDFc2_fitgp]=fitgpparams(x2,maxdist,xnew);
[CDF1,X1]=ecdf(x1,'censoring',(x1>=maxdist));
[CDF2,X2]=ecdf(x2,'censoring',(x2>=maxdist));
[CDFc1,Xc1]=ecdf(x1cut);
[CDFc2,Xc2]=ecdf(x2cut);
%Plot everything
legendtext={'Data';'Cut Data (x=0,x=maxdist removed)';'MLE (censored, Data)';'MLE (censored, Cut Data)';'Fitdist (uncensored, Cut Data)'};
figure(1)
clf
subplot(2,2,1)
hold on
stairs(X1,CDF1,'k')
stairs(Xc1,CDFc1,'b-.')
plot(xnew,CDF1_mle,'r',xnew,CDFc1_mle,'r--',xnew,CDFc1_fitgp,'g--')
xlim([0 1]) 
ylabel('Cumulative Probability')
title('Dataset 1 (zero values)')
legend(legendtext)
subplot(2,2,2)
hold on
stairs(X2,CDF2,'k')
stairs(Xc2,CDFc2,'b-.')
plot(xnew,CDF2_mle,'r',xnew,CDFc2_mle,'r--',xnew,CDFc2_fitgp,'g--')
xlim([0 50]) 
title('Dataset 2 (censored)')
subplot(2,2,3)
hold on  
histogram(x1,20,'Normalization','pdf');
plot(xnew,PDF1_mle,'r',xnew,PDFc1_mle,'r--',xnew,PDFc1_fitgp,'g--')
xlim([0 1]) 
ylabel('Probability Density')
xlabel('x')
subplot(2,2,4)
hold on  
histogram(x2,50,'Normalization','pdf');
plot(xnew,PDF2_mle,'r',xnew,PDFc2_mle,'r--',xnew,PDFc2_fitgp,'g--')
xlim([0 50]) 
xlabel('x')
legend(legendtext([1 3 4 5]))
function [xdata,PDF_mle,PDFc_mle,PDFc_fitgp,CDF_mle,CDFc_mle,CDFc_fitgp]=fitgpparams(xdata,maxdist,xnew)
    %Anonymous functions to call pdf and cdf for mle
    pdfgp=@(x,k,sigma,theta) gppdf(x,k,sigma,theta);
    cdfgp=@(x,k,sigma,theta) gpcdf(x,k,sigma,theta);
    %METHOD 1: MAX LIKELIHOOD ESTIMATE FOR GENERALIZED PARETO
    %input parameters for mle
    startgp=[1,1,0];
    lowergp=[0,0,0];
    uppergp=[100,100,min(xdata)];
    options = statset('MaxIter',300, 'MaxFunEvals',600); % increase iterations so GP converges
    [param_mle]=mle(xdata, 'pdf',pdfgp,'cdf',cdfgp, 'start',startgp, 'lower',lowergp, 'upper',uppergp, 'censoring',xdata>=maxdist, 'options',options);
    pd_mle=makedist('GeneralizedPareto',param_mle(1),param_mle(2),param_mle(3));
    PDF_mle=pdf(pd_mle,xnew);
    CDF_mle=cdf(pd_mle,xnew);
    %Cut/shift data because fitgp can't censor and neither mle nor fitgp can deal with x=0
    xdata(xdata==0)=0.001; %shift all x=0 values to x=0.001
    xdata(xdata>=maxdist)=[]; %remove all data past the censoring point
    uppergp=[100,100,min(xdata)];
    [paramc_mle]=mle(xdata, 'pdf',pdfgp,'cdf',cdfgp, 'start',startgp, 'lower',lowergp, 'upper',uppergp, 'censoring',xdata>=maxdist, 'options',options);
    pdc_mle=makedist('GeneralizedPareto',paramc_mle(1),paramc_mle(2),paramc_mle(3));
    PDFc_mle=pdf(pdc_mle,xnew);
    CDFc_mle=cdf(pdc_mle,xnew);
    %METHOD 2: FITDIST (OR GPFIT SEEMS TO DO THE SAME THING)
    pdc_fitgp=fitdist(xdata,'GeneralizedPareto');%,'censoring',xdata>=maxdist); %CENSORING OPTION DOESN'T WORK
    PDFc_fitgp=pdf(pdc_fitgp,xnew);
    CDFc_fitgp=cdf(pdc_fitgp,xnew);
end