Note: This page has been translated by MathWorks. Click here to see

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Density-based spatial clustering of applications with noise (DBSCAN)

`idx = dbscan(X,epsilon,minpts)`

`idx = dbscan(X,epsilon,minpts,Name,Value)`

`idx = dbscan(D,epsilon,minpts,'Distance','precomputed')`

`[idx,corepts] = dbscan(___)`

partitions observations in the `idx`

= dbscan(`X`

,`epsilon`

,`minpts`

)*n*-by-*p* data matrix
`X`

into clusters using the DBSCAN algorithm (see Algorithms).
`dbscan`

clusters the observations (or points) based on a threshold
for a neighborhood search radius `epsilon`

and a minimum number of
neighbors `minpts`

required to identify a core point. The function
returns an *n*-by-1 vector (`idx`

) containing cluster
indices of each observation.

returns a vector of cluster indices for the precomputed pairwise distances
`idx`

= dbscan(`D`

,`epsilon`

,`minpts`

,`'Distance'`

,'precomputed')`D`

between observations. `D`

can be the output of
`pdist`

or `pdist2`

, or a more general dissimilarity vector or matrix conforming to the
output format of `pdist`

or `pdist2`

,
respectively.

For improved speed when iterating over many values of

`epsilon`

, consider passing in`D`

as the input to`dbscan`

. This approach prevents the function from having to compute the distances at every point of the iteration.If you use

`pdist2`

to precompute`D`

, do not specify the`'Smallest'`

or`'Largest'`

name-value pair arguments of`pdist2`

to select or sort columns of`D`

. Selecting fewer than*n*distances results in an error, because`dbscan`

expects`D`

to be a square matrix. Sorting the distances in each column of`D`

leads to a loss in the interpretation of`D`

and can give meaningless results when used in the`dbscan`

function.For efficient memory usage, consider passing in

`D`

as a logical matrix rather than a numeric matrix to`dbscan`

when`D`

is large. By default, MATLAB^{®}stores each value in a numeric matrix using 8 bytes (64 bits), and each value in a logical matrix using 1 byte (8 bits).To select a value for

`minpts`

, consider a value greater than or equal to the number of dimensions of the input data plus one [1]. For example, for an*n*-by-*p*matrix`X`

, set`'minpts'`

equal to*p*+1 or greater.One possible strategy for selecting a value for

`epsilon`

is to generate a*k*-distance graph for`X`

. For each point in`X`

, find the distance to the*k*th nearest point, and plot sorted points against this distance. Generally, the graph contains a knee. The distance that corresponds to the knee is typically a good choice for`epsilon`

, because it is the region where points start tailing off into outlier (noise) territory [1].

DBSCAN is a density-based clustering algorithm that is designed to discover clusters and noise in data. The algorithm identifies three kinds of points: core points, border points, and noise points [1]. For specified values of

`epsilon`

and`minpts`

, the`dbscan`

function implements the algorithm as follows:From the input data set

`X`

, select the first unlabeled observation*x*as the current point, and initialize the first cluster label_{1}*C*to 1.Find the set of points within the epsilon neighborhood

`epsilon`

of the current point. These points are the neighbors.If the number of neighbors is less than

`minpts`

, then label the current point as a noise point (or an outlier). Go to step 4.### Note

`dbscan`

can reassign noise points to clusters if the noise points later satisfy the constraints set by`epsilon`

and`minpts`

from some other point in`X`

. This process of reassigning points happens for border points of a cluster.Otherwise, label the current point as a core point belonging to cluster

*C*.

Iterate over each neighbor (new current point) and repeat step 2 until no new neighbors are found that can be labeled as belonging to the current cluster

*C*.Select the next unlabeled point in

`X`

as the current point, and increase the cluster count by 1.Repeat steps 2–4 until all points in

`X`

are labeled.

If two clusters have varying densities and are close to each other, that is, the distance between two border points (one from each cluster) is less than

`epsilon`

, then`dbscan`

can merge the two clusters into one.Every valid cluster might not contain at least

`minpts`

observations. For example,`dbscan`

can identify a border point belonging to two clusters that are close to each other. In such a situation, the algorithm assigns the border point to the first discovered cluster. As a result, the second cluster is still a valid cluster, but it can have fewer than`minpts`

observations.

[1] Ester, M., H.-P. Kriegel, J.
Sander, and X. Xiaowei. “A density-based algorithm for discovering clusters in large
spatial databases with noise.” In *Proceedings of the Second International
Conference on Knowledge Discovery in Databases and Data Mining*, 226-231.
Portland, OR: AAAI Press, 1996.