Skip to content

Commit

Permalink
20220520
Browse files Browse the repository at this point in the history
  • Loading branch information
sun1638650145 committed May 20, 2022
1 parent 1c12251 commit 0d55c1a
Showing 1 changed file with 92 additions and 2 deletions.
94 changes: 92 additions & 2 deletions ML.md
Original file line number Diff line number Diff line change
Expand Up @@ -3436,8 +3436,6 @@ $$
JC=\frac{a}{a+b+c}
$$

*

* FM指数(Folkeds and Mallows Index, 简称$\text{FMI}$)
$$
FMI=\sqrt{\frac{a}{a+b}·\frac{a}{a+c}}
Expand Down Expand Up @@ -3856,3 +3854,95 @@ $$
* 第3-5行EM算法的E步, 第6-11行EM算法的M步.
* 算法的停止条件可设置为最大迭代轮数或似然函数$LL(D)$增长很少甚至不再增长, 第14-17行根据高斯混合分布确定簇划分.

## 9.5.密度聚类

* 密度聚类亦称基于密度的聚类(density-based clustering), 此类算法假设聚类结构能通过样本分布的紧密程度确定.

* 密度聚类算法从样本密度的角度来考虑样本之间的可连接性, 并基于可连接样本不断扩展聚类簇以获得最终的聚类结果.

* DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一种著名的密度聚类算法, 它基于一组邻域(neighborhood)参数$(\epsilon, MinPts)$来刻画样本分布的紧密程度.

* 给定数据集$D=\{\pmb{x}_1,\pmb{x}_2,...,\pmb{x}_m\}$, 定义下面这几个概念:

* $\epsilon$-邻域: 对$\pmb{x}_j\in D$, 其$\epsilon$-邻域包含样本集$D$中与$\pmb{x}_j$的距离不大于$\epsilon$的样本, 即$N_\epsilon(\pmb{x}_j)=\{\pmb{x}_i\in D|\text{dist}(\pmb{x}_i,\pmb{x}_j)\leqslant\epsilon\}$;
* 核心对象(core object): 若$\pmb{x}_j$的$\epsilon$-领域至少包含$MinPts$个样本, 即$\abs{N_\epsilon(\pmb{x}_j)}\geqslant MinPts$, 则是一个核心对象$\pmb{x}_j$;
* 密度直达(directly density-reachable): 若$\pmb{x}_j$位于$\pmb{x}_i$的$\epsilon$-领域中, 且$\pmb{x}_i$是核心对象, 则称$\pmb{x}_j$由$\pmb{x}_i$密度直达;
* 密度直达关系通常不满足对称性.
* 密度可达(density-reachable): 对$\pmb{x}_i$与$\pmb{x}_j$, 若存在样本序列$\pmb{p}_1, \pmb{p}_2, ..., \pmb{p}_n$, 其中$\pmb{p}_1=\pmb{x}_i$, $\pmb{p}_n=\pmb{x}_j$且$\pmb{p}_{i+1}$由$\pmb{p}_i$密度直达, 则称$\pmb{x}_j$由$\pmb{x}_i$密度可达.
* 密度可达关系满足直递性, 但不满足对称性.
* 密度相连(density-connected): 对$\pmb{x}_i$与$\pmb{x}_j$, 若存在$\pmb{x}_k$使得$\pmb{x}_i$与$\pmb{x}_j$均由$\pmb{x}_k$密度可达, 则称$\pmb{x}_i$与$\pmb{x}_j$密度相连.
* 密度相连关系满足对称性.

* DBSCAN将簇定义为: 有密度可达关系导出的最大的密度相连样本集合.

* $D$中不属于任何簇的样本被认为是噪声(noise)或者异常(anomaly)样本.
* 给定邻域参数$(\epsilon, MinPts)$, 簇$C\subseteq D$是满足以下性质的非空样本子集:
* 连接性(connectivity): $\pmb{x}_i\in C$, $\pmb{x}_j\in C\Rightarrow\pmb{x}_i$与$\pmb{x}_j$密度相连
* 最大性(maximality): $\pmb{x}_i\in C$, $\pmb{x}_j$由$\pmb{x}_i$密度可达 $\Rightarrow\pmb{x}_j\in C$

* 若$\pmb{x}$为核心对象, 由$\pmb{x}$密度可达的所有样本组成的集合记为$X=\{\pmb{x}'\in D|\pmb{x}'$ 由 $\pmb{x}$ 密度可达$\}$, 则可证明$X$即为满足连续性和最大性的簇.

* DBSCAN 算法任选数据集中的一个核心对象为种子(seed), 再由此出发确定相应的聚类簇.

* DBSCAN 算法描述

---

<b>输入:</b> 样本集$D=\{\pmb{x}_1, \pmb{x}_2, ...,\pmb{x}_m\}$;

​ 邻域参数$(\epsilon, MinPts)$.

<b>过程:</b>

1:初始化核心对象集合: $\Omega = \varnothing$

2: <b>for</b> $j=1,2,...,m$ <b>do</b>

3: 确定样本$\pmb{x}_j$的$\epsilon$-邻域$N_\epsilon(\pmb{x}_j)$;

4: <b>if</b> $\abs{N_\epsilon(\pmb{x}_j)}\geqslant MinPts$ <b>then</b>

5: 将样本$\pmb{x}_j$加入核心对象集合: $\Omega=\Omega\bigcup\{\pmb{x}_j\}$

6: <b>end if</b>

7: <b>end for</b>

8:初始化聚类簇数: $k=0$

9:初始化未访问样本集合: $\Gamma=D$

10:<b>while</b> $\Omega\neq\varnothing$ <b>do</b>

11: 记录当前未访问样本集合: $\Gamma_\text{old}=\Gamma$;

12: 随机选取一个核心对象$\pmb{o}\in\Omega$, 初始化队列$Q=<\pmb{o}>$;

13: $\Gamma=\Gamma\setminus\{\pmb{o}\}$;

14: <b>while</b> $Q\neq\varnothing$ <b>do</b>

15: 取出队列$Q$中的首个样本$\pmb{q}$;

16: <b>if</b> $\abs{N_\epsilon(\pmb{q})}\geqslant MinPts$ <b>then</b>

17: 令$\Delta=N_\epsilon(\pmb{q})\bigcap\Gamma$;

18: 将$\Delta$中的样本加入队列$Q$;

19: $\Gamma=\Gamma\setminus\Delta$;

20: <b>end if </b>

21: <b>end while</b>

22: $k=k+1$, 生成聚类簇$C_k=\Gamma_\text{old}\setminus\Gamma$;

23: $\Omega=\Omega\setminus C_k$

24:<b>end while</b>

<b>输出</b>: 簇划分$C=\{C_1, C_2, ..., C_k\}$

---

0 comments on commit 0d55c1a

Please sign in to comment.