1 Introduction
Many distancebased algorithms have a bias towards dense clusters ^{2, 5, 13, 36, 26}. For example, (i) DBSCAN ^{12} is biased towards grouping neighbouring dense clusters into a single cluster, in the presence of a sparse cluster ^{36}. (ii) The NN anomaly detector ^{4, 27} employs the
NN density estimator
^{16, 23} to estimate the density of every point in a dataset, and then sorts all the points based on their estimated density in descending order. Lowdensity points are regarded as more likely to be anomalous than highdensity points. Since the ranking is based on global densities, this bias results in the misclassification of socalled “local anomalies” (points considered anomalous with respect to their neighborhood) in high density regions as normal points ^{7, 8, 9}.A number of techniques have been proposed to “correct” this bias, notably, the densityratio approach of ReCon ^{36}, ReScale ^{36} and DScale ^{37} (for densitybased clustering such as DBSCAN ^{14} and DP ^{29}), and the reachabilitydistance approach of LOF ^{7} (for the NN anomaly detector ^{27}).
The densityratio approach ^{36, 37} aims to transform the data in such a way that the estimated density of each transformed point approximates the densityratio of that point in the original space, and as a result all locally lowdensity points are easily separated from all locally highdensity points using a single global threshold.
While the current densityratio methods have been shown to improve the clustering performance of densitybased clustering, we identify their shortcomings and propose a new algorithm to better achieving the aim of the densityratio approach. Furthermore, we extend its application to distancebased anomaly detection.
We propose to perform a multidimensional Cumulative Distribution Function (CDF) based transformation on the given dataset as a preprocessing step, which learns a mapping ^{1}^{1}1The mapping depends on all points in the given dataset , where each point asserts a (local and pointbased) scaling factor to . such that the resulting Euclidean distance between pairs of points
is “locally consistent” with the Euclidean distance between points in the original space (i.e., all local interpoint distances are proportional to their original values), while the final transformed dataset has a uniform distribution, as estimated by density estimator
using a certain bandwidth. This effectively achieves the desired aim of CDF transform, i.e., density equalisation w.r.t. a density estimator with a certain bandwidth, without impairing the cluster structure in the original dataset.This paper makes the following contributions:

It generalises a current CDF scaling method DScale ^{37} so that it can be applied to density estimators using uniform kernels with either fixed or variable bandwidths.

It proposes a new TransformShift method called CDFTS (CDF TransformShift) that changes the volumes of each point’s neighbourhood simultaneously in the full multidimensional space. Existing CDF transform methods are either individual attribute based (e.g., ReScale ^{36}) or individual point based (e.g., DScale ^{37}). As far as we know, this is the first attempt to perform a multidimensional CDF transform to achieve the desired effect of homogenising (making uniform) the distribution of an entire dataset w.r.t a density estimator.

It applies the new TransformShift method to densitybased clustering and distancebased anomaly detection to demonstrate its impact in overcoming the weaknesses of three existing algorithms.
The proposed approach CDFTS differs from existing approaches in terms of:

Methodology: While ReScale ^{36}, DScale ^{37} and the proposed CDFTS follow the same principled approach and have the same aim, the proposed CDF TransformShift method is a multidimensional technique which incorporates both transformation and pointshift, which greatly expand the approach’s applicability. In contrast, ReScale is a onedimensional transformation and pointshift technique; and DScale incorporates transformation without pointshift. The ‘complete’ CDFTS method leads to a better result in both clustering and anomaly detection, which we will show in Section 6.

Metric compliance: CDFTS and ReScale create a similarity matrix which is a metric; whereas the one produced by DScale ^{37} is not a metric, i.e., it does not satisfy the symmetry or triangle inequalities.

Ease of use: Like Rescale, CDFTS transforms the data as a preprocessing step. The targeted (clustering or anomaly detection) algorithm can then applied unaltered to the transformed data. In contrast, DScale requires an algorithmic modification in order to use it.
To demonstrate the effects of different dissimilarity measures on image segmentation, we took the image shown in Figure 0(a), and applied different rescaling methods (no rescaling, ReScale, DScale and CDFTS) and clustering the 3dimensional pixel values of the image in LAB space ^{32}. Tables 1 show the results of applying the clustering algorithm DP ^{29} to produce three clusters. The scatter plots in LAB space in Figure 1 show that the three clusters become more uniform distributed with similar density and the gaps between their boundaries are much larger for CDFTS than the other three scaling methods. As a result, DP with CDFTS yields the best clustering result, shown in Table 1.
Cluster 1  Cluster 2  Cluster 3  

No Scaling 

ReScale 

DScale 

CDFTS 
The rest of the paper is organised as follows. We describe issues of inhomogeneous density in densitybased clustering and distancebased anomaly detection in Section 2. Section 3 provides the densityratio estimation as a principle to address the issues of inhomogenous density. Section 4 presents the two existing CDF scaling approaches based on densityratio. Section 5 proposes the localdensity transformshift as a multidimensional CDF scaling method. Section 6 empirically evaluates the performance of existing densitybased clustering and distancebased anomaly detection algorithms on the transformedandshifted datasets. Discussion and the conclusions are provided in the last two sections.
2 Issues of inhomogeneous density
2.1 Densitybased clustering
Let , denote a dataset of points, each sampled independently from a distribution . Let denote the density estimate of point which approximates the true density . In addition, let be the neighbourhood of , , where is the distance function ().
In general, the density of a point can be estimated via a small neighbourhood (as used by the classic densitybased clustering algorithm DBSCAN ^{14}) defined as follows:
(1) 
where is the volume of a dimensional ball of radius .
A set of clusters is defined as nonempty and nonintersecting subsets: . Let denote the mode (point of the highest estimated density) for cluster ; and denote the corresponding peak density value.
DBSCAN uses a global density threshold to identify core points (which have densities higher than the threshold); and then it links neighbouring core points together to form clusters ^{14}. It is defined as follows.
Definition 1.
A core point is a point with an estimated density above or equal to a userspecified threshold , i.e., , where denotes a set indicator function.
Definition 2.
Using a density estimator with density threshold , a point is density connected with another point in a sequence of unique points from , i.e., : is defined as:
Definition 3.
A cluster detected by the densitybased algorithm DBSCAN is a maximal set of density connected instances, i.e., , where is the cluster mode.
For this kind of algorithm to find all clusters in a dataset, the data distribution must have the following necessary condition: the peak density of each cluster must be greater than the maximum over all possible paths of the minimum density along any path linking any two modes.^{2}^{2}2A path linking two modes and is defined as a sequence of unique points starting with and ending with where adjacent points lie in each other’s neighbourhood. This condition is formally described by Zhu et al ^{36} as follows:
(2) 
where is the largest of the minimum density along any path linking the mode of for clusters and .
This condition implies that there must exist a threshold that can be used to break all paths between the modes by assigning regions with density less than to noise. Otherwise, if the mode of some cluster has a density lower than that of a lowdensity region between other clusters, then this kind of densitybased clustering algorithm will fail to find all clusters. Either some highdensity clusters will be merged together (when a lower density threshold is used), or lowdensity clusters will be designated as noise (when a higher density threshold is used). To illustrate, Figure 1(a) shows that using a high threshold will cause all points in Cluster to be assigned to noise but using a low threshold will cause points in and to be assigned to the same cluster.
(a) A mixture of three Gaussian distributions that cannot be separated using a single density threshold; (b) Density distribution on ReScaled data of (a), where a single density threshold can be found to separated all three clusters. Note that point
, and are shifted to , and , respectively. Here is a larger bandwidth than .2.2 kNN anomaly detection
A classic nearestneighbour based anomaly detection algorithm assigns an anomaly score to an instance based on its distance to the th nearest neighbour ^{27}. The instances with the largest anomaly scores are identified as anomalies.
Given a dataset and the parameter , the density of can be estimated using a th nearest neighbour density estimator (as used by the classic NN anomaly detector ^{27}):
(3) 
where is the distance between and its th nearest neighbour in a dataset .
Note that the th nearest neighbour distance is a proxy to the density of , i.e., high indicates low density, and vice versa.
Let be the set of all normal points in a dataset . The condition under which the classic NN anomaly detector could, with an appropriate setting of a density/distance threshold, identify every anomaly in is given as follows:
(4) 
Equation 4 states that all anomalies must have the highest NN distances (or lowest densities) in order to detect them. In other words, NN anomaly detectors can detect both global anomalies and scattered anomalies which have lower densities than that of all normal points ^{1, 20}.
However, based on this characteristic, NN anomaly detectors are unable to detect:

Local anomalies. This is because a local anomaly with low density relative to nearby normal (nonanomalous) instances in a region of high average density may still have higher density than that of normal (nonanomalous) instances in regions of lower average density ^{7}. Translating this in terms of th NN distance, we have: . Here, some local anomalies (for example, points located around the boundaries of and ) are ranked lower than the normal points located around the cenre sparse cluster (), as shown in Figure 1(a).

Anomalous clusters (sometimes referred to as clustered anomalies ^{21}) are groups of points that are too small to be considered “true clusters” and are found in low density parts of the space (i.e. are separate from the other clusters). A purported benefit of the kNN anomaly detector is that it is able to remove such anomalous clusters providing is sufficiently large (larger than the size of the anomalous group) ^{1}. By rescaling the data we are able to extend this property to locally anomalous clusters, i.e. to be tolerant to variations in the local (background or average) density over the space. An example is shown in Figure 3.
2.3 Summary
Both densitybased clustering and distancebased anomaly detector have weaknesses when it comes to handling datasets with inhomogeneous density clusters. Rather than creating a new density estimator or modifying the existing clustering and anomaly detection algorithm procedures, we advocate transforming data to be more uniformly distributed than it is in the original space such that the separation between clusters can be identified easily.
3 Densityratio estimation
Densityratio estimation is a principled approach to overcome the weakness of densitybased clustering for detecting clusters with inhomogeneous densities ^{36}.
The densityratio of a point is the ratio of two density estimates calculated using the same density estimator, but with two different bandwidth settings.
Let and be density estimators using kernels of bandwidth and , respectively. Given the constraint that the denominator has larger bandwidth than the numerator , the density ratio of is estimated as
(5) 
We correct the lemma from ^{36} regarding the density ratio value :
Lemma 1.
For any data distribution and sufficiently small values of and s.t. , if is at a local maximum density of , then ; and if is at a local minimum density of , then .
Since points located at local highdensity areas (almost invariably) have densityratio higher than points located at local lowdensity areas, a global densityratio threshold around unity can be used to identify all cluster peaks and break all paths between different clusters. Thus, based on densityratio estimation, existing densitybased clustering algorithms such as DBSCAN can identify clusters as regions of locally high density, separated by regions of locally low density.
Similarly, a densitybased anomaly detector is able to detect local anomalies since their densityratio values are lower and ranked higher than normal points with locally high densities.
4 CDF scaling to overcome the issue of inhomogeneous densities
4.1 Onedimensional CDF scaling
Let and be a density estimator and a cumulative density estimator, respectively, using a kernel of bandwidth .
Let be a transformed point of using as follows:
(6) 
Then, we have the property as follows ^{36}:
(7) 
and
(8) 
ReScale ^{36} is a representative implementation algorithm based on this transform. Figure 1(b) and Figure 2(b) show ReScale rescales the data distribution on two 1dimensional datasets, respectively. They show that clusters and anomalies are easier to identify after the application of ReScale.
Since this transform can be performed on a onedimensional dataset only, ReScale must apply the transformation to each attribute independently for a multidimensional dataset.
4.2 Multidimensional CDF scaling
Using a distance scaling method, a multidimensional transform can be achieved by simply rescaling the distances between each point and all the other instances in its local neighbourhood, (thereby considering all of the dimensions of the multidimensional dataset at once).
Given a point , the distance between and all points in its neighbourhood ^{3}^{3}3For reasons that will become obvious shortly, we now reparameterise the neighbourhood function to depend on the distance measure . can be rescaled using a scaling function :
(9) 
where is the scaled distance of .
Here, the scaling function depends on both the position and size of the neighbourhood . It is defined as follows using the estimated density with the aim of making the density distribution within the neighbourhood uniform:
(10) 
where is the maximum pairwise distance in . Note that we have now reparameterised the density estimator to include the distance function s, in order to facilitate calculations below.
With reference to the uniform distribution which has density , where is the volume of the ball having radius :
(11)  
(12) 
That is, the process rescales sparse regions to be more dense by using , and dense regions to be more sparse by using ; such that the entire dataset is approximately uniformly distributed in the scaled neighbourhood. More spefically, after rescaling distances with s’, the density of points in the neighbourhood of size around is the same as the density of points across the whole dataset:
(13) 
Note that the above derivation is possible because (i) the scaling is isotropic about
(hence the shape of the unit ball doesn’t change only its size) and (ii) the uniformkernel density estimator is local (i.e., its value depends only on points within the
neighbourhood).In order to maintain the same relative ordering of distances between and all other points in the dataset, the distance between and any point not in the neighbourhood can be normalised by a simple  normalisation:
(14) 
It is interesting to note that providing is sufficiently large that the average of the density estimates across the data points approximates the average density over the space , then we have that the average rescaling factor is approximately 1: , , and thus the average distance between points is unchanged after rescaling:^{4}^{4}4Proof: For we have that since and are independent. The same can be shown for points .
(15) 
The implementation of DScale ^{37}, which is a representative algorithm based on this multidimensional CDF scaling, is provided in Algorithm 2 in Appendix Appendix: DScale algorithm.
A lemma about the density on the rescaled distance is given as follows:
Lemma 2.
The density with the rescaled distance is approximately proportional to the densityratio in terms of the original distance within the neighbourhood of .
Proof.
where and . ∎
Note that the above lemma is only valid within the neighbourhood of , and when each is treated independently. In other words, the transform is only valid locally. In addition, the rescaled distance is asymmetric, i.e., when .
To be a valid transform globally for the entire dataset, we propose in this paper to perform an iterative process that involves two steps: distance rescaling and point shifting.
5 CDF TransformShift (CDFTS)
Instead of just rescaling distances between each data instance and all other points in the dataset, we can apply the rescaling directly to the dataset by translating points in approximate accordance with the rescaled distances. Two advantages of the new approach are: (i) the shifted dataset becomes approximately uniformly distributed as a whole; and (ii) the standard Euclidean distance can then be used to measure distances between the transformed points, rather than the nonsymmetric and as a result nonmetric rescaled distance.
Consider two points . In order to make the distribution around point more uniform, we wish to rescale (expand or contract) the distance between and to be as defined by Equations 9 and 14. We can do this by translating the point to a new point denoted which lies along the direction as follows:
(16) 
Note that the distance between and the newly transformed point satisfies the rescaled distance requirement:
(17) 
Above we considered the effect of translating point to to scale the space around point . In order to approximately scale the space around all points in the dataset, point needs to be translated by the average of the above translations:
(18) 
A theorem regarding the final shifted data is provided as follows:
Theorem 1.
Given a sufficiently large bandwidth , the neighbourhood density of is expected to be more uniform than that of :
Proof.
Let us define the bandwidth for the transformed neighbourhood around as where and . We can now estimate as follows. We have
Since is in the neighbourhood of we have , and
Supposing that varies slowly within the neighbourhood , we have:
Then based on Equation 15, given a sufficiently large such that and providing the count is sufficiently small, we have:
This means that is mainly affected by the point located in the neighbourhood of , i.e., . Then we have
The above relation can be rewritten as:
As , the relation between and depends on whether is in the sparse or dense region:
(19)  
(20) 
When the density varies slowly in the , we can safely assume that each point’s nearest neighbours keep similar after transform and the density still varied slowly in neighbourhood. Then we have and can estimate as
(21)  
Here we have three scenarios, depending on the density of :
Therefore, , . ∎
Since our target is to get a uniformly distributed data w.r.t. , this transformshift process can be repeated multiple times using an iterative method to reduce the neighbourhood density variation on the shifted dataset in the previous iteration.
This is accomplished by using an expectationmaximisationlike algorithm. The objective is to minimise the density variance in
.A condition for terminating the iteration process is when the total distance of pointshifts from to in the latest iteration is less than a threshold , i.e.,
(22) 
Note that, due to the shifts in each iteration, the value ranges of attributes of the shifted dataset are likely to be different from those of the original dataset . We can use a minmax normalisation ^{3} on each attribute to keep the values in the same fixed range at the end of each iteration.
Figure 4 illustrates the effects of CDFTS on a twodimensional data with different iterations. It shows that the original clusters with different density become increasing uniform as the number of iterations increases.
The implementation of CDFTS is shown in Algorithm 1. The DScale algorithm used in step 6 is provided in Algorithm 2 in Appendix Appendix: DScale algorithm. Algorithm 1 requires two parameters and .
5.1 Density estimators applicable to CDFTS
The following two types of density estimators, with fixedsize and variablesize bandwidths of the uniform kernel, are applicable to CDFTS to do the CDF scaling:
(a) neighbourhood density estimator:
where is the volume of the ball with radius .
When this density estimator is used in CDFTS, the neighbourhood described in Sections 4.2 and 5, i.e., , where , denoting the fixedsize bandwidth uniform kernel.
(b) th nearest neighbour density estimator:
where is the th nearest neighbour of ; is th nearest neighbour distance of .
When this density estimator is used in CDFTS, the neighbourhood size becomes dependent on the location and the distribution of surrounding points. We denote as the distance to the nearest neighbour of . In this case the density is simply calculated over the neighbourhood: , denoting the variablesize bandwidth uniform kernel, i.e., small in dense region and large in sparse region. Note that, in this circumstance, the density of the shifted points still become more uniformly w.r.t their surrounding.
In the following experiments, we employ the same density estimators as used in DBSCAB and DP (in clustering) and NN anomaly detector in CDFTS to transform to , i.e., neighbourhood density estimator is used in CDFTS for clustering; and th nearest neighbour density estimator is used in CDFTS for anomaly detection.
6 Empirical Evaluation
Dataset  Data Size  #Dimensions  #Clusters 

Segment  2310  19  7 
Mice  1080  83  8 
Biodeg  1055  41  2 
ILPD  579  9  2 
ForestType  523  27  4 
Wilt  500  5  2 
Musk  476  166  2 
Libras  360  90  15 
Dermatology  358  34  6 
Ecoli  336  7  8 
Haberman  306  3  2 
Seeds  210  7  3 
Lung  203  3312  5 
Wine  178  13  3 
3L  560  2  3 
4C  1250  2  4 
This section presents experiments designed to evaluate the effectiveness of CDFTS.
All algorithms used in our experiments were implemented in Matlab R2017a (the source code can be obtained at https://sourceforge.net/p/cdfts/). The experiments are run on a machine with eight cores CPU (Intel Core i77820X @ 3.60GHz), 32GB memory and a 2560 CUDA cores GPU (GeForce GTX 1080). All datasets were normalised using the  normalisation to yield each attribute to be in [0,1] before the experiments began.
For clustering, we used 2 artificial datasets and 14 realworld datasets with different data sizes and dimensions from UCI Machine Learning Repository
^{11}. Table 2 presents the data properties of the datasets. 3L is a 2dimensional data containing three elongated clusters with different densities, as shown in Figure 4(a). 4C is a 2dimensional dataset containing four clusters with different densities (three Gaussian clusters and one elongated cluster), as shown in Figure 4(b). Note that DBSCAN is unable to correctly identify all clusters in both of these datasets because they do not satisfy the condition specified in Equation 2. Furthermore, clusters in 3L significantly overlap on individual attribute projections, which violates the requirement of ReScale that the onedimensional projections allow for the identification of the density peaks of each cluster.For anomaly detection, we compared the anomaly detection performance on 2 synthetic datasets with anomalous clusters and 10 realworld benchmark datasets^{5}^{5}5Velocity, Ant and Tomcat are from http://openscience.us/repo/defect/ck/ and others are from UCI Machine Learning Repository ^{11}.. The data size, dimensions and percentage of anomalies are shown in Table 3. Both the Syn 1 and Syn 2 datasets contain clusters of anomalies. Their data distributions are shown in Figure 6.
Dataset  Data Size  #Dimensions  % Anomaly 

AnnThyroid  7200  6  7.42% 
Pageblocks  5473  10  10.23% 
Tomcat  858  20  8.97% 
Ant  745  20  22.28% 
BloodDonation  604  4  5.63% 
Vowel  528  10  9.09% 
Mfeat  410  649  2.44% 
Dermatology  366  20  5.46% 
Balance  302  4  4.64% 
Velocity  229  20  34.06% 
Syn 1  520  2  3.85% 
Syn 2  860  1  6.98% 
6.1 Clustering
In this section, we compare CDFTS with ReScale and DScale using two existing densitybased clustering algorithms, i.e., DBSCAN ^{14} and DP ^{29}, in terms of Fmeasure ^{28}: given a clustering result, we calculate the precision score and the recall score for each cluster
based on the confusion matrix, and then the Fmeasure score of
is the harmonic mean of
and . The overall Fmeasure score is the unweighted (macro) average over all clusters: Fmeasure.^{6}^{6}6It is worth noting that other evaluation measures such as purity and Normalised Mutual Information (NMI) ^{31} only take into account the points assigned to clusters and do not account for noise. A clustering algorithm which assigns the majority of the points to noise may result in a high clustering performance. Thus the Fmeasure is more suitable than purity or NMI in assessing the clustering performance of densitybased clustering.We report the best clustering performance after performing a parameter search for each algorithm. Table 4 lists the parameters and their search ranges for each algorithm. in ReScale controls the precision of , i.e., the number of intervals used for estimating . We set as the default value for CDFTS.
Algorithm  Parameter with search range 

DBSCAN  ; 
DP  ; 
ReScale  ; 
DScale  
CDFTS  ; 
Table 5 shows the best Fmeasures for DBSCAN, DP, their ReScale, DScale and CDFTS versions. The average Fmeasures, showed in the secondtolast row, reveal that CDFTS improves the clustering performance of either DBSCAN or DP with a larger performance gap than both ReScale and DScale. In addition, CDFTS is the best performer on many more datasets than other contenders (shown in the last row of Table 5.)
Data  DBSCAN  DP  

Orig  ReS  DS  CDFTS  Orig  ReS  DS  CDFTS  
Segment  0.59  0.62  0.61  0.67  0.78  0.77  0.80  0.84 
Mice  0.99  0.99  0.99  0.993  1.00  1.00  1.00  1.00 
Biodeg  0.45  0.44  0.47  0.52  0.72  0.74  0.73  0.76 
ILPD  0.41  0.42  0.56  0.52  0.60  0.63  0.62  0.64 
ForestType  0.27  0.51  0.48  0.65  0.69  0.83  0.70  0.85 
Wilt  0.38  0.39  0.54  0.55  0.54  0.68  0.54  0.74 
Musk  0.51  0.52  0.51  0.53  0.55  0.58  0.61  0.62 
Libras  0.41  0.44  0.46  0.49  0.376  0.375  0.376  0.38 
Dermatology  0.52  0.73  0.74  0.83  0.91  0.97  0.91  0.96 
Ecoli  0.37  0.40  0.54  0.60  0.48  0.55  0.63  0.64 
Haberman  0.47  0.64  0.59  0.66  0.56  0.63  0.58  0.67 
Seeds  0.75  0.88  0.85  0.83  0.91  0.92  0.92  0.94 
Lung  0.49  0.54  0.49  0.66  0.70  0.74  0.64  0.72 
Wine  0.64  0.86  0.80  0.90  0.93  0.95  0.96  0.962 
3L  0.59  0.63  0.90  0.88  0.82  0.81  0.86  0.89 
4C  0.71  0.90  0.92  0.95  0.87  0.92  0.95  0.91 
Average  0.53  0.62  0.65  0.70  0.72  0.75  0.74  0.78 
#Top 1  0  1  2  13  1  3  2  13 
It is interesting to see that CDFTS exhibits a large performance improvement over both DBSCAN and DP on many datasets, such as ForestType, Wilt, Dermatology, Ecoli, Lung and Wine. On many of these datasets, CDFTS also shows a large performance gap in comparison with ReScale and/or DScale.
Note that the performance gap is smaller for DP than it is for DBSCAN because DP is a more powerful algorithm which does not rely on a single density threshold to identify clusters.
To evaluate whether the performance difference among the three scaling algorithms is significant, we conduct the Friedman test with the posthoc Nemenyi test ^{10}.
Figures 6(a) and 6(b) show the results of the significance test for DBSCAN and DP, respectively. The results show that the CDFTS versions are significantly better than the DScale and ReScale versions for both DBSCAN and DP.
Here we compare the different effects on the transformed datasets due to ReScale and CDFTS using the MDS visualisation^{7}^{7}7Multidimensional scaling (MDS) ^{6} is used for visualising a highdimensional dataset in a 2dimensional space through a projection method which preserves as well as possible the original pairwise dissimilarities between instances..
The effects on the two synthetic datasets are shown in Figure 8 and Figure 10. They show that both the rescaled datasets are more axisparallel distributed using ReScale than those using CDFTS. For the 3L dataset, the blue cluster is still very sparse after running ReScale, as shown in Figure 7(a). In contrast, it becomes denser using CDFTS, as shown in Figure 8(a). As a result, DBSCAN has much better performance with CDFTS on the 3L dataset.
Figure 10 shows the MDS plots on the Wilt dataset which CDFTS has the largest performance improvement over both DBSCAN and DP. The figure shows that the two clusters are more uniformly distributed in the new space which makes their boundary easier to be identified than that in the original space. In contrast, based on distance, most points of the red cluster are surrounded by points of the blue cluster, as shown in Figure 9(a).
6.2 Anomaly detection
In this section, we evaluate the ability of CDFTS to detect local anomalies based on NN anomaly detection.
Two stateoftheart local anomaly detector, Local Outlier Factor (LOF)
^{7} and iForest ^{19, 22}, are also used in the comparison. Table 6 lists the parameters and their search ranges for each algorithm. Parameters and in iForest control the subsample size and number of iTrees, respectively. We report the best performance of each algorithm on each dataset in terms of best AUC (Area under the Curve of ROC) ^{15}.Algorithm  Parameters and their search range 

NN/LOF  
iForest  ; 
ReScale  ; 
DScale  
CDFTS  ; 
Table 7 compares the best AUC score of each algorithm. CDFTSNN achieves the highest average AUC of 0.90 and performs the best on 6 out of 12 datasets. For the Syn 1 dataset which has overlapping on the individual attribute projection, DScale and CDFTS have much better results than ReScale.
Data  NN  LOF  iForest  Res  DS  CDFTS 

AnnThyroid  0.65  0.68  0.88  0.76  0.71  0.94 
Pageblocks  0.89  0.93  0.90  0.88  0.86  0.94 
Tomcat  0.63  0.67  0.81  0.77  0.69  0.78 
Ant  0.67  0.68  0.77  0.76  0.71  0.76 
BloodDonation  0.69  0.79  0.82  0.84  0.80  0.86 
Vowel  0.93  0.93  0.92  0.932  0.90  0.94 
Mfeat  0.99  0.98  0.98  0.99  1.00  0.99 
Dermatology  0.91  0.99  0.86  0.99  0.96  1.00 
Balance  0.94  0.95  0.91  0.91  0.83  0.92 
Velocity  0.66  0.62  0.65  0.67  0.694  0.69 
Syn 1  0.92  0.93  0.90  0.89  0.95  0.98 
Syn 2  0.88  0.88  0.86  0.92  0.95  0.94 
Average  0.81  0.83  0.86  0.86  0.84  0.90 
#Top 1  0  1  2  1  3  6 
Figure 11 shows the significance test on the three algorithms applied to NN anomaly detector. It can be seen from the results that CDFTSNN is significantly better than DScaleNN and ReScaleNN.
It is interesting to mention that CDFTSNN has the largest AUC improvement over NN on Dermatology and AnnThyroid, shown in Table 7. Table 8 compares their MDS plots between the original datasets and the shifted (density equalised) datasets. It shows that most anomalies become anomalous clusters and farther away from normal points on the shifted datasets, making them easier to detect by NN anomaly detector. Note that some anomalies are closed to normal clusters and have higher densities than many normal instances in the original dataset.
Distance  CDFTS  

Dermatology 

AnnThyroid 
6.3 Runtime
ReScale, DScale and CDFTS are all preprocessing methods, their computational complexities are shown in Table 9. Because many existing densitybased clustering algorithms have time and space complexities of , all these methods does not increase their overall complexities.
Algorithm  Time complexity  Space complexity 

ReScale  
DScale and CDFTS 
Table 10 shows the runtime of the dissimilarity matrix calculation for each of the three methods on the Mfeat, Segment, Pageblocks and AnnThyroid datasets. Their parameters are set to be the same for all datasets, i.e., , and . The result shows that CDFTS took the longest runtime because it requires multiple DScale calculations.
Dataset  CPU  GPU  

ReScale  DScale  CDFTS  ReScale  DScale  CDFTS  
Mfeat  0.69  0.03  1.26  0.05  0.02  2.89 [t] 
Segment  0.29  0.28  2.68  0.02  0.05  0.11 
Pageblocks  0.16  1.64  18.64  0.02  0.17  2.84 
AnnThyroid  0.19  3.09  24.40  0.02  0.26  2.97 
7 Discussion
7.1 Parameter sensitivity
Both DScale and CDFTS have one critical parameter to define the neighbourhood. The densityratio based on a small will approximate 1 and provides no information; while on a large will approximate the estimated density and have no advantage. Generally, and in DBSCAN and DP shall be set slightly smaller than .
The parameter in CDFTS controls the number of iterations, i.e., a smaller usually results in a higher number of iterations in the CDFTS process, and makes the shifted dataset more uniform. In our experiments, we set as default value and got significantly better results than existing algorithms. When tuning the best for different datasets, CDFTS would get better results on most datasets.
7.2 Relation to LOF
LOF ^{7} is a local densitybased approach for anomaly detection. The LOF score of is the ratio of the density of to the average density of ’s nearest neighbours, where the density of is inversely proportional to the average distance to its nearestneighbours ^{7}.
Though LOF is a densityratio approach, it does not employ CDF transform, which is the core technique in CDFTS. While LOF has the ability to detect local anomalies, it is weaker in identifying anomalous clusters than NN which employs CDFTS.
7.3 Relation to metric learning
There are many unsupervised distance metric learning algorithms which transform data e.g., global methods such as PCA ^{18}, KPCA ^{30} and KUMMP ^{35}); and local methods such as Isomap ^{33}, tSNE ^{24} and LPP ^{17}. These methods are usually used as dimension reduction methods such that data clusters in the learned lowdimensional space are adjusted based on some objective function ^{34}.
CDFTS is an unsupervised algorithm based on the CDF transform such that the data are more uniform in the scaled space. Though it is a linear and local method, CDFTS is not a metric learning method for dimension reduction.
We have examined the performance of tSNE in densitybased clustering and NN anomaly detection. We set the dimensionality of its output space to be the same as the original data. We found that tSNE has inconsistent clustering performance. For example, while it produced the best clustering results on some highdimensional datasets such as the ForestType and Lung datasets; but it yielded the worst results on the Breast and 3L datasets. In addition, tSNE could not improve the performance on NN anomaly detector. This is because anomalies are still dense or close to some dense clusters in the tSNE transformed space.
7.4 Comparison with a density equalisation method
In the context of kernel means ^{25}, NN kernel ^{26} has been suggested to be a way to equalise the density of a given dataset to reduce the bias of kernel means algorithm and to improve the clustering performance on datasets with inhomogeneous densities. The dissimilarity matrix generated by NN kernel is a binary matrix can be seen as the adjacent matrix of NN graph such that “1” means a point is in the set of nearest neighbours of another point; and “0” otherwise. Thus, a means clustering algorithm can be used to group points based on their NN graph.
However, NN kernel cannot be applied to both densitybased clustering and NN anomaly detection. This is because it converts all points to have the same density in the new space regardless of the bandwidth used by a density estimator used by the intended algorithm. Thus, DBSCAN will either group neighbouring clusters into a single cluster or assign all clusters to noise. This outcome is not conducive in producing a good clustering result.
DP also cannot work with NN kernel for the same reason. DP links a point to another point with a higher density to form clusters in the last step of the clustering process ^{29}. DP would not be able to link points when all points have the same density.
For NN anomaly detection, replacing the distance measure with the NN kernel produces either 0 or 1 similarity between any two points. This does not work for anomaly detection.
8 Conclusions
We introduce a CDF TransformShift (CDFTS) algorithm based on a multidimensional CDF transform to effectively deal with datasets with inhomogeneous densities. It is the first attempt to change the volumes of every point’s neighbourhood in the entire multidimensional space in order to get the desired effect of uniform distribution for the whole dataset w.r.t a density estimator. Existing CDF transform methods such as ReScale ^{36} and DScale ^{37} achieved a much reduced effect on the recaled dataset. We show that this more sophisticated CDF transform brings about a much better outcome.
In addition, CDFTS is more generic than existing CDF transform methods with the ability to use different density estimators, i.e, uniform kernels with either fixed or variable bandwidths. As a result, CDFTS can be applied to more algorithms/tasks with a better outcome than existing methods such as NN kernel ^{26}.
Through intensive evaluations, we show that CDFTS significantly improves the performance of two existing densitybased clustering algorithms and one existing distancebased anomaly detector.
Appendix: DScale algorithm
Algorithm 2 is a generalisation of the implementation of DScale ^{37} which can employ a density estimator with a bandwidth parameter . It requires one parameter only.
Comments
There are no comments yet.