The "Obesity Classifier Using CURE Algorithm" project aims to analyze a dataset containing BMI (Body Mass Index) and age information to classify individuals into different weight categories (e.g., underweight, normal weight, overweight, obese) using the CURE (Clustering Using Representatives) algorithm. Additionally, this project demonstrates how CURE outperforms K-means clustering for this problem statement.
The CURE algorithm is particularly applicable in this project due to its ability to handle clusters of arbitrary shape and size. Since BMI and age data may not necessarily form spherical clusters, CURE's flexibility in representing clusters using a small set of representative points makes it suitable for this analysis. Additionally, CURE is robust to outliers, which may be present in real-world BMI and age datasets.
- getDistanceFromRepresentatives
Calculates the minimum distance from a given point to any of the representative points of a cluster.
def getDistanceFromRepresentatives(point, representativePoints_shifted):
minimumDistance = float("inf")
for repr_point in representativePoints_shifted:
distance = getDistance(point, repr_point)
if distance < minimumDistance:
minimumDistance = distance
return minimumDistance
Purpose: This function helps in assigning a point to the nearest cluster based on the shifted representative points.
- computeCentroid
Computes the centroid (geometric center) of a cluster of points.
def computeCentroid(initialCluster):
x = 0
y = 0
for point in initialCluster:
x += point[0]
y += point[1]
numberOfPoints = len(initialCluster)
return (x / numberOfPoints, y / numberOfPoints)
Purpose: To find the center of a cluster, which is used to shift representative points towards the centroid.
- findRepresentativePoints
Selects n well-scattered representative points from a cluster.
def findRepresentativePoints(initialCluster):
representativePoints = []
representativePoints.append(list(initialCluster[0]))
for i in range(n - 1):
maximumDistance = float("-inf")
for point in initialCluster:
if point in representativePoints:
continue
minimumDistance = float("inf")
for representativePoint in representativePoints:
distance = getDistance(point, representativePoint)
if distance < minimumDistance:
minimumDistance = distance
if minimumDistance > maximumDistance:
candidateRepresentativePoint = point
maximumDistance = minimumDistance
representativePoints.append(list(candidateRepresentativePoint))
return representativePoints
Purpose: To select a set of points that are representative of the cluster’s shape and spread.
- getDistance
Calculates the Euclidean distance between two points.
def getDistance(point1, point2):
distance = (point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2
return math.sqrt(distance)
Purpose: To measure the distance between two points, which is used in various parts of the clustering process.
- clusterDistance
Calculates the minimum distance between any two points from two different clusters.
def clusterDistance(cluster1, cluster2):
minimumDistance = float("inf")
for point1 in cluster1:
for point2 in cluster2:
if minimumDistance > getDistance(point1, point2):
minimumDistance = getDistance(point1, point2)
return minimumDistance
Purpose: To determine the closeness of two clusters during hierarchical clustering.
- formClusters_heirarchical
Performs hierarchical clustering on the sample data until the desired number of clusters k is achieved.
def formClusters_heirarchical(sampleData):
clusters = [[i] for i in sampleData]
iters = len(sampleData) - k
for iter in range(iters):
min = float("inf")
for i in range(0, len(clusters) - 1):
for j in range(i + 1, len(clusters)):
if min > clusterDistance(clusters[i], clusters[j]):
min = clusterDistance(clusters[i], clusters[j])
c1 = i
c2 = j
clusters[c1].extend(clusters[c2])
del clusters[c2]
return clusters
Purpose: To reduce the initial dataset into k clusters using a hierarchical approach.
Reads sample and complete data files along with parameters for clustering.
sampleDataFile = open(sys.argv[1]).readlines()
completeDataFile = open(sys.argv[2]).readlines()
k = int(sys.argv[3])
n = int(sys.argv[4])
p = float(sys.argv[5])
outputFileName = sys.argv[6]
Converts the read data into lists of tuples representing points.
completeData = []
for line in completeDataFile:
line = line.split(",")
completeData.append((float(line[0]), float(line[1])))
sampleData = []
for line in sampleDataFile:
line = line.split(",")
sampleData.append((float(line[0]), float(line[1])))
sampleData = sorted(sampleData, key=lambda x: (x[0], x[1]))
Forms initial clusters using hierarchical clustering on the sample data.
initialClusters = formClusters_heirarchical(sampleData)
For each cluster, finds representative points and shifts them towards the centroid.
# For plotting initial clusters
initialClusterAssignments = []
for clusterId, initialCluster in enumerate(initialClusters):
for point in initialCluster:
initialClusterAssignments.append((point, clusterId))
# Find representative points and shift them
for initialCluster in initialClusters:
representivePoints = findRepresentativePoints(initialCluster)
representativePointsList.append(representivePoints)
shiftedRepresentativePoints = []
centroid = computeCentroid(initialCluster)
for representativePoint in representivePoints:
shiftX = (centroid[0] - representativePoint[0]) * p
shiftY = (centroid[1] - representativePoint[1]) * p
shiftedRepresentativePoints.append((representativePoint[0] + shiftX, representativePoint[1] + shiftY))
representativePoints_shifted.append(shiftedRepresentativePoints)
Assigns each point in the complete dataset to the nearest cluster based on the representative points.
outputPointList = []
for point in completeData:
minimumDistance = float("inf")
for clusterNum in range(k):
distance = getDistanceFromRepresentatives(point, representativePoints_shifted[clusterNum])
if distance < minimumDistance:
minimumDistance = distance
clusterId = clusterNum
outputPointList.append((point, clusterId))
Saves the clustering results to a specified output file.
w = open(outputFileName, 'w')
for point in outputPointList:
w.write(str(point[0][0]) + "," + str(point[0][1]) + "," + str(point[1]) + "\n")
w.close()
Uses matplotlib to create a scatter plot of the points colored by their assigned cluster. One for initial clusters and one for final clusters
# Plot initial clusters
colors = plt.cm.get_cmap("tab10", k)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for point, clusterId in initialClusterAssignments:
plt.scatter(point[0], point[1], color=colors(clusterId))
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Initial Clusters')
# Plot final clusters
plt.subplot(1, 2, 2)
for point, clusterId in outputPointList:
plt.scatter(point[0], point[1], color=colors(clusterId))
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Final CURE Clusters')
plt.show()
- Sampling: Initially, a sample of the data is used to create clusters.
- Hierarchical Clustering: The sample data undergoes hierarchical clustering to form k clusters.
- Representative Points: For each cluster, n representative points are chosen that best capture the geometry of the cluster.
- Point Shifting: These representative points are then shifted towards the cluster's centroid by a fraction p to reduce the effect of outliers.
- Assigning Clusters: Each point in the complete dataset is assigned to the nearest cluster based on the minimum distance to the shifted representative points.
- Visualization: Finally, the clustered points are plotted using different colors to visualize the clustering results.
- Clone the repo
git clone https://github.com/AnonO6/ObesityClassifierUsingCUREAlgorithm.git
cd ObesityClassifierUsingCUREAlgorithm
- Install pandas, matplotlib and (scikit-learn for k-means clustring in compare_clusters.py)
pip install pandas
pip install matplotlib
pip install scikit-learn
- Ensure you have complete_data.csv file containing your complete data. Run the script to generate sample_data.csv:
python generate_sample_data.py
- Open the cure_algorithm_with_plots.py script and update the parameters as needed (e.g., number of clusters, sample size). Run the script below, replace sample_data.csv, complete_data.csv, and output.csv with your actual file names. Adjust the parameters (3, 5, 0.2) according to your requirements.
python clustering_cure.py sample_data.csv complete_data.csv 3 5 0.2 output.csv
- To run compare_clusters.py to compare the result of k-means vs CURE run the following script. Make sure to replace sample_data.csv, complete_data.csv, and output.csv with your actual file names. Adjust the parameters (3, 5, 0.2) according to your requirements.
python compare_clusters.py sample_data.csv complete_data.csv 3 5 0.2 output_cure.csv
- After running the script, two plots will be displayed: one showing the initial clusters and the other showing the final clusters.