Breast Cancer Prediction with Python, R and K-nearest Neighbors (KNN)

A Beginner’s guides learning popular Data Science programming languages

Back in 1995 before Data Science was popular UCI University of Wisconsin published a dataset of labeled 569 patients with Malignant and Benign tumor findings according to thirty different measures. A sample copy of the csv can be found at https://fashion.s3.us-east-2.amazonaws.com/breast_cancer.csv

The thirty features given from the above CSV are: mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius worst, texture worst, perimeter worst, area worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, worst fractal dimension. The label for each patient was provided as target_name.

Python and R

The two most common programming languages for data analysis and data sciences are still Python and R. The breast cancer study is a perfect case to showcase the basic and differences of the two in term of defining functions, setting up tables or data frame, list, array and iteration to pull the data, analyze the data and put the analysis result (prediction) in a simple format.

As this article is aimed at a beginner the machine learning part of K-nearest neighbors classifier from Scikit Learn or such will not be elaborated. Instead, it simply brings a Pythagorean theorem alike like below to make a prediction.

Pythagorean Theorem and Euclidean Distance

In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance.

One dimension

The distance between any two points on the real line is the absolute value of the numerical difference of their coordinates. Thus if p and q are two points on the real line, then the distance between them is given by

Two dimensions

In the Euclidean plane, let point p have Cartesian coordinates (p1,p2) and let point q have coordinates (q1,q2). Then the distance between p and q is :

Higher dimensions

In three dimensions, for points given by their Cartesian coordinates, the distance is

In general, for points given by Cartesian coordinates in n-dimensional Euclidean space, the distance is

Why these n-dimensional Pythagorean theories matters? In the Breast Cancer case, there are 30 dimensions involved unlike our regular 3D space which only occupies X, Y, and Z axes. The breast cancer prediction has 30 axes. The below representation only show the X and Y axes for an easy explanation.

K-Nearest Neighbors (KNN) for Prediction

To complete the prediction the analysis will be split into three to four stages as below. From the data given by UCI, the first task would be making a function to compute the distance between 2 datasets. Each of them has 30 dimensions.

Secondly, it is needed to define a function to list the distance from a certain patient to all of her neighbors.

Lastly, all the neighbors as listed above would be sorted from the closest one to the furthest but usually, it would be limited to only three nearest neighbors.

The First Function

Below examples are the two different codes from Python and R but gave the same result. Taking points 3 and 4 the distance would be 1407.69 for this case. Python always starts the array from the number 0 and ends with 1 step less. While R always starts from number 1 and ends exactly with the object length. Other things include embedded sqrt in R without necessary import.

Both codes have few steps. First, download the CSV file from AWS S3 which is a duplicate of 1995 data from UCI. After 31 x 569 data was downloaded, it was later split into X (30 column features) and y (1 column of the label) values. Emptied variable squared_difference was initiated and subsequently added by a newly computed value from each of the 30 feature computations. Python iteration uses range in the syntax while R is using seq_along as below.

The Second One

Below code from both Python and R require one argument which is the focused patient. The function finds all the distances of its neighbors. To make a better comparison both codes are tweaked as below to enable a data frame.

In the same function, the patient label was also added.

The Final Codes

Finally, the term K in KNN is now revealed. It explains a number of neighbors (K) to consider when computes the prediction. Out of 569 cases originally filed there are 357 positive cases. KNN approach tries to aggregate the features from a number of patients and predict a probability of malignant based on the group votes (says it would be benign is the group average less than 50% where 0 represents benign patients and 1 for malignant). The below study showed the predicted numbers of positive cases deplete as the K value went higher. The K value should not be too small (it is not an image recognition to match all the features such as fingerprint scan) nor too big.

The wait is over, herewith is the illustration of using both Python and R to compute the prediction. It should be split into different functions but they were put in a single function below to condense this article. The function can return back only one value, it is necessary to enable/disable returns as below.

As you can see Python starts indexing from 0 while R starts from 1. It does not affect the prediction regardless of the language. One more thing I took a sort method with lambda the harder way. Sure in pandas DataFrame there is a simpler way to sort a column. Just trying to show how to use lambda :)

Feel free to play with the code as it is available at https://paiza.io/projects/WqfI-O1leo7Q603ncLOCeg

I prefer to copy and paste the above code to http://makemeanalyst.com/run-your-r-code/ as it allows longer computing time (not like 2 seconds at paiza).