Density-based spatial clustering of applications with noise (DBSCAN)

`idx = dbscan(X,epsilon,minpts)`

`idx = dbscan(X,epsilon,minpts,Name,Value)`

`idx = dbscan(D,epsilon,minpts,'Distance','precomputed')`

`[idx,corepts] = dbscan(___)`

partitions observations in the `idx`

= dbscan(`X`

,`epsilon`

,`minpts`

)*n*-by-*p* data matrix
`X`

into clusters using the DBSCAN algorithm (see Algorithms).
`dbscan`

clusters the observations (or points) based on a threshold
for a neighborhood search radius `epsilon`

and a minimum number of
neighbors `minpts`

required to identify a core point. The function
returns an *n*-by-1 vector (`idx`

) containing cluster
indices of each observation.

returns a vector of cluster indices for the precomputed pairwise distances
`idx`

= dbscan(`D`

,`epsilon`

,`minpts`

,`'Distance'`

,'precomputed')`D`

between observations. `D`

can be the output of
`pdist`

or `pdist2`

, or a more general dissimilarity vector or matrix conforming to the
output format of `pdist`

or `pdist2`

,
respectively.

For improved speed when iterating over many values of

`epsilon`

, consider passing in`D`

as the input to`dbscan`

. This approach prevents the function from having to compute the distances at every point of the iteration.If you use

`pdist2`

to precompute`D`

, do not specify the`'Smallest'`

or`'Largest'`

name-value pair arguments of`pdist2`

to select or sort columns of`D`

. Selecting fewer than*n*distances results in an error, because`dbscan`

expects`D`

to be a square matrix. Sorting the distances in each column of`D`

leads to a loss in the interpretation of`D`

and can give meaningless results when used in the`dbscan`

function.For efficient memory usage, consider passing in

`D`

as a logical matrix rather than a numeric matrix to`dbscan`

when`D`

is large. By default, MATLAB^{®}stores each value in a numeric matrix using 8 bytes (64 bits), and each value in a logical matrix using 1 byte (8 bits).To select a value for

`minpts`

, consider a value greater than or equal to the number of dimensions of the input data plus one [1]. For example, for an*n*-by-*p*matrix`X`

, set`'minpts'`

equal to*p*+1 or greater.One possible strategy for selecting a value for

`epsilon`

is to generate a*k*-distance graph for`X`

. For each point in`X`

, find the distance to the*k*th nearest point, and plot sorted points against this distance. Generally, the graph contains a knee. The distance that corresponds to the knee is typically a good choice for`epsilon`

, because it is the region where points start tailing off into outlier (noise) territory [1].

DBSCAN is a density-based clustering algorithm that is designed to discover clusters and noise in data. The algorithm identifies three kinds of points: core points, border points, and noise points [1]. For specified values of

`epsilon`

and`minpts`

, the`dbscan`

function implements the algorithm as follows:From the input data set

`X`

, select the first unlabeled observation*x*as the current point, and initialize the first cluster label_{1}*C*to 1.Find the set of points within the epsilon neighborhood

`epsilon`

of the current point. These points are the neighbors.If the number of neighbors is less than

`minpts`

, then label the current point as a noise point (or an outlier). Go to step 4.### Note

`dbscan`

can reassign noise points to clusters if the noise points later satisfy the constraints set by`epsilon`

and`minpts`

from some other point in`X`

. This process of reassigning points happens for border points of a cluster.Otherwise, label the current point as a core point belonging to cluster

*C*.

Iterate over each neighbor (new current point) and repeat step 2 until no new neighbors are found that can be labeled as belonging to the current cluster

*C*.Select the next unlabeled point in

`X`

as the current point, and increase the cluster count by 1.Repeat steps 2–4 until all points in

`X`

are labeled.

If two clusters have varying densities and are close to each other, that is, the distance between two border points (one from each cluster) is less than

`epsilon`

, then`dbscan`

can merge the two clusters into one.Every valid cluster might not contain at least

`minpts`

observations. For example,`dbscan`

can identify a border point belonging to two clusters that are close to each other. In such a situation, the algorithm assigns the border point to the first discovered cluster. As a result, the second cluster is still a valid cluster, but it can have fewer than`minpts`

observations.

[1] Ester, M., H.-P. Kriegel, J.
Sander, and X. Xiaowei. “A density-based algorithm for discovering clusters in large
spatial databases with noise.” In *Proceedings of the Second International
Conference on Knowledge Discovery in Databases and Data Mining*, 226-231.
Portland, OR: AAAI Press, 1996.