Data Mining Projects: Wine Quality Analysis and Machine Learning
Data Mining Projects: Wine Quality Analysis and Machine Learning
In this post, I’ll share insights from my Data Mining projects, which demonstrate advanced statistical analysis and machine learning techniques applied to wine quality datasets using R programming language.
Project Overview
The Data Mining projects showcase comprehensive analysis of wine quality data using various statistical and machine learning techniques. Built using R programming language and deployed on DigitalOcean infrastructure, these projects demonstrate real-world data science workflows.
Project Components
Wine Quality Analysis
- Dataset: Comprehensive wine quality dataset with chemical properties
- Techniques: K-means clustering, statistical analysis, data visualization
- Tools: R programming, RStudio Server, statistical packages
- Infrastructure: DigitalOcean cloud server with RStudio Server
Data Mining Techniques
- Clustering: K-means clustering for wine classification
- Statistical Analysis: Descriptive statistics and correlation analysis
- Data Visualization: Advanced plotting and visualization
- Predictive Modeling: Quality prediction models
Technical Architecture
Infrastructure Setup
# DigitalOcean Server Setup
# Ubuntu 20.04 LTS
# 2GB RAM, 1 CPU, 50GB SSD
# Install R and RStudio Server
sudo apt update
sudo apt install r-base r-base-dev
sudo apt install gdebi-core
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.4.1106-amd64.deb
sudo gdebi rstudio-server-1.4.1106-amd64.deb
# Install additional R packages
sudo R -e "install.packages(c('ggplot2', 'dplyr', 'cluster', 'factoextra', 'corrplot', 'VIM', 'mice'))"
# Configure RStudio Server
sudo systemctl enable rstudio-server
sudo systemctl start rstudio-serverR Environment Configuration
# R configuration for data mining
# .Rprofile
# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# Load essential packages
library(ggplot2)
library(dplyr)
library(cluster)
library(factoextra)
library(corrplot)
library(VIM)
library(mice)
# Set working directory
setwd("/home/rstudio/data-mining")
# Configure plotting
theme_set(theme_minimal())Data Analysis Implementation
Data Loading and Preprocessing
# Load wine quality dataset
wine_data <- read.csv("winequality-red.csv", sep = ";")
# Data structure exploration
str(wine_data)
summary(wine_data)
# Check for missing values
sum(is.na(wine_data))
VIM::aggr(wine_data, col = c('navyblue', 'red'), numbers = TRUE, sortVars = TRUE)
# Data preprocessing
wine_clean <- wine_data %>%
filter(!is.na(quality)) %>%
mutate(
quality_factor = as.factor(quality),
quality_binary = ifelse(quality >= 6, "Good", "Poor")
)
# Feature scaling
wine_scaled <- wine_clean %>%
select(-quality, -quality_factor, -quality_binary) %>%
scale() %>%
as.data.frame()
# Add quality back
wine_scaled$quality <- wine_clean$quality
wine_scaled$quality_binary <- wine_clean$quality_binaryExploratory Data Analysis
# Descriptive statistics
summary_stats <- wine_clean %>%
select(-quality_factor, -quality_binary) %>%
summary()
# Correlation analysis
correlation_matrix <- cor(wine_clean %>% select(-quality_factor, -quality_binary))
corrplot(correlation_matrix, method = "circle", type = "upper",
order = "hclust", tl.cex = 0.8, tl.col = "black")
# Quality distribution
quality_distribution <- ggplot(wine_clean, aes(x = quality)) +
geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.7) +
labs(title = "Distribution of Wine Quality Ratings",
x = "Quality Rating",
y = "Frequency") +
theme_minimal()
# Box plots for each feature by quality
feature_plots <- wine_clean %>%
select(-quality_factor, -quality_binary) %>%
gather(key = "feature", value = "value", -quality) %>%
ggplot(aes(x = as.factor(quality), y = value, fill = as.factor(quality))) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~feature, scales = "free_y") +
labs(title = "Feature Distribution by Quality Rating",
x = "Quality Rating",
y = "Value") +
theme_minimal() +
theme(legend.position = "none")K-means Clustering Implementation
# K-means clustering function
perform_kmeans <- function(data, k_values = 2:8, nstart = 25) {
results <- list()
for (k in k_values) {
# Perform K-means clustering
kmeans_result <- kmeans(data, centers = k, nstart = nstart)
# Calculate silhouette score
sil_score <- silhouette(kmeans_result$cluster, dist(data))
avg_sil_score <- mean(sil_score[, 3])
# Store results
results[[paste0("k", k)]] <- list(
kmeans = kmeans_result,
silhouette = avg_sil_score,
wss = kmeans_result$tot.withinss,
bss = kmeans_result$betweenss
)
}
return(results)
}
# Prepare data for clustering (exclude quality variables)
clustering_data <- wine_scaled %>%
select(-quality, -quality_binary)
# Perform K-means clustering
clustering_results <- perform_kmeans(clustering_data)
# Elbow method for optimal K
wss_values <- sapply(clustering_results, function(x) x$wss)
k_values <- 2:8
elbow_plot <- data.frame(k = k_values, wss = wss_values) %>%
ggplot(aes(x = k, y = wss)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "red", size = 3) +
labs(title = "Elbow Method for Optimal K",
x = "Number of Clusters (K)",
y = "Within-Cluster Sum of Squares") +
theme_minimal()
# Silhouette analysis
silhouette_scores <- sapply(clustering_results, function(x) x$silhouette)
silhouette_plot <- data.frame(k = k_values, silhouette = silhouette_scores) %>%
ggplot(aes(x = k, y = silhouette)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "red", size = 3) +
labs(title = "Silhouette Analysis for Optimal K",
x = "Number of Clusters (K)",
y = "Average Silhouette Score") +
theme_minimal()
# Optimal K selection
optimal_k <- k_values[which.max(silhouette_scores)]
cat("Optimal number of clusters:", optimal_k, "\n")Advanced Clustering Analysis
# Perform clustering with optimal K
optimal_clustering <- clustering_results[[paste0("k", optimal_k)]]
wine_clusters <- optimal_clustering$kmeans$cluster
# Add cluster information to dataset
wine_analysis <- wine_clean %>%
mutate(cluster = as.factor(wine_clusters))
# Cluster characteristics
cluster_summary <- wine_analysis %>%
group_by(cluster) %>%
summarise(
count = n(),
avg_quality = mean(quality),
avg_alcohol = mean(alcohol),
avg_volatile_acidity = mean(volatile.acidity),
avg_citric_acid = mean(citric.acid),
avg_residual_sugar = mean(residual.sugar),
avg_chlorides = mean(chlorides),
avg_free_sulfur_dioxide = mean(free.sulfur.dioxide),
avg_total_sulfur_dioxide = mean(total.sulfur.dioxide),
avg_density = mean(density),
avg_pH = mean(pH),
avg_sulphates = mean(sulphates)
)
# Visualize cluster characteristics
cluster_plot <- wine_analysis %>%
select(cluster, alcohol, volatile.acidity, citric.acid, quality) %>%
gather(key = "feature", value = "value", -cluster) %>%
ggplot(aes(x = cluster, y = value, fill = cluster)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~feature, scales = "free_y") +
labs(title = "Cluster Characteristics",
x = "Cluster",
y = "Value") +
theme_minimal() +
theme(legend.position = "none")
# PCA for dimensionality reduction and visualization
pca_result <- prcomp(clustering_data, scale = TRUE)
pca_data <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
cluster = as.factor(wine_clusters),
quality = wine_clean$quality
)
# PCA visualization
pca_plot <- ggplot(pca_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7, size = 2) +
labs(title = "PCA Visualization of Wine Clusters",
x = paste0("PC1 (", round(summary(pca_result)$importance[2, 1] * 100, 1), "%)"),
y = paste0("PC2 (", round(summary(pca_result)$importance[2, 2] * 100, 1), "%)")) +
theme_minimal()Statistical Analysis
# ANOVA analysis for cluster differences
anova_results <- list()
for (feature in names(clustering_data)) {
formula <- as.formula(paste(feature, "~ cluster"))
anova_result <- aov(formula, data = wine_analysis)
anova_results[[feature]] <- summary(anova_result)
}
# Post-hoc analysis (Tukey's HSD)
posthoc_results <- list()
for (feature in names(clustering_data)) {
formula <- as.formula(paste(feature, "~ cluster"))
aov_result <- aov(formula, data = wine_analysis)
tukey_result <- TukeyHSD(aov_result)
posthoc_results[[feature]] <- tukey_result
}
# Quality prediction model
library(randomForest)
# Prepare data for modeling
model_data <- wine_clean %>%
select(-quality_factor, -quality_binary, -cluster)
# Split data
set.seed(123)
train_indices <- sample(1:nrow(model_data), 0.7 * nrow(model_data))
train_data <- model_data[train_indices, ]
test_data <- model_data[-train_indices, ]
# Random Forest model
rf_model <- randomForest(quality ~ ., data = train_data, ntree = 100)
# Model evaluation
predictions <- predict(rf_model, test_data)
rmse <- sqrt(mean((test_data$quality - predictions)^2))
mae <- mean(abs(test_data$quality - predictions))
cat("Random Forest Model Performance:\n")
cat("RMSE:", rmse, "\n")
cat("MAE:", mae, "\n")
# Feature importance
importance_df <- data.frame(
feature = rownames(rf_model$importance),
importance = rf_model$importance[, 1]
) %>%
arrange(desc(importance))
importance_plot <- ggplot(importance_df, aes(x = reorder(feature, importance), y = importance)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
coord_flip() +
labs(title = "Feature Importance in Quality Prediction",
x = "Feature",
y = "Importance") +
theme_minimal()Data Visualization Dashboard
# Create comprehensive visualization dashboard
create_dashboard <- function() {
# Quality distribution
p1 <- ggplot(wine_clean, aes(x = quality)) +
geom_histogram(binwidth = 1, fill = "steelblue", alpha = 0.7) +
labs(title = "Wine Quality Distribution", x = "Quality", y = "Count")
# Alcohol vs Quality
p2 <- ggplot(wine_clean, aes(x = alcohol, y = quality)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Alcohol Content vs Quality", x = "Alcohol %", y = "Quality")
# Cluster visualization
p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7) +
labs(title = "Cluster Visualization (PCA)", x = "PC1", y = "PC2")
# Feature importance
p4 <- ggplot(importance_df[1:10, ], aes(x = reorder(feature, importance), y = importance)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
coord_flip() +
labs(title = "Top 10 Feature Importance", x = "Feature", y = "Importance")
# Combine plots
library(gridExtra)
dashboard <- grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
return(dashboard)
}
# Generate dashboard
wine_dashboard <- create_dashboard()Advanced Analytics
Time Series Analysis
# If temporal data is available
if ("year" %in% names(wine_data)) {
# Time series analysis
yearly_quality <- wine_clean %>%
group_by(year) %>%
summarise(
avg_quality = mean(quality),
count = n(),
.groups = 'drop'
)
time_series_plot <- ggplot(yearly_quality, aes(x = year, y = avg_quality)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "red", size = 3) +
labs(title = "Wine Quality Trends Over Time",
x = "Year",
y = "Average Quality") +
theme_minimal()
}Advanced Machine Learning
# Support Vector Machine
library(e1071)
# Prepare data for SVM
svm_data <- wine_clean %>%
mutate(quality_binary = as.factor(ifelse(quality >= 6, "Good", "Poor"))) %>%
select(-quality, -quality_factor)
# SVM model
svm_model <- svm(quality_binary ~ ., data = svm_data, kernel = "radial")
# Model evaluation
svm_predictions <- predict(svm_model, svm_data)
svm_accuracy <- mean(svm_predictions == svm_data$quality_binary)
cat("SVM Model Accuracy:", svm_accuracy, "\n")
# Neural Network
library(nnet)
# Prepare data for neural network
nn_data <- wine_clean %>%
select(-quality_factor, -quality_binary)
# Neural network model
nn_model <- nnet(quality ~ ., data = nn_data, size = 10, linout = TRUE)
# Model evaluation
nn_predictions <- predict(nn_model, nn_data)
nn_rmse <- sqrt(mean((nn_data$quality - nn_predictions)^2))
cat("Neural Network RMSE:", nn_rmse, "\n")Infrastructure Management
RStudio Server Configuration
# RStudio Server configuration
# /etc/rstudio/rserver.conf
www-port=8787
www-address=0.0.0.0
auth-minimum-user-id=1000
auth-required-user-group=rstudio-users
# Enable authentication
auth-pam-helper-path=/usr/lib/rstudio-server/bin/rstudio-pam
# Configure session timeout
session-timeout-minutes=120
session-max-processes=10Data Backup and Version Control
# Automated backup script
#!/bin/bash
# backup_data.sh
BACKUP_DIR="/home/rstudio/backups"
DATA_DIR="/home/rstudio/data-mining"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup data files
tar -czf $BACKUP_DIR/wine_data_$DATE.tar.gz $DATA_DIR/*.csv
# Backup R scripts
tar -czf $BACKUP_DIR/r_scripts_$DATE.tar.gz $DATA_DIR/*.R
# Clean old backups (keep last 7 days)
find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete
echo "Backup completed: $DATE"Performance Monitoring
# Performance monitoring script
monitor_performance <- function() {
# System information
system_info <- Sys.info()
# Memory usage
memory_info <- memory.size()
# R session info
session_info <- sessionInfo()
# Performance metrics
performance_data <- data.frame(
timestamp = Sys.time(),
memory_usage = memory_info,
r_version = R.version.string,
platform = system_info["sysname"]
)
# Log performance data
write.csv(performance_data, "performance_log.csv", append = TRUE, row.names = FALSE)
return(performance_data)
}
# Run performance monitoring
performance_log <- monitor_performance()Results and Insights
Key Findings
- Quality Distribution: Most wines fall in the 5-6 quality range
- Alcohol Content: Higher alcohol content correlates with better quality
- Acidity Levels: Balanced acidity is crucial for wine quality
- Clustering: K-means identified distinct wine categories
- Predictive Models: Random Forest achieved good prediction accuracy
Statistical Significance
- ANOVA Results: Significant differences between clusters for most features
- Correlation Analysis: Strong correlations between certain chemical properties
- Model Performance: RMSE < 0.7 for quality prediction models
Business Implications
- Quality Control: Chemical properties can predict wine quality
- Production Optimization: Focus on key quality indicators
- Market Segmentation: Different wine categories for different markets
Lessons Learned
Data Science Workflow
- Data Preprocessing: Critical for accurate analysis
- Exploratory Analysis: Essential for understanding data patterns
- Model Selection: Different algorithms for different objectives
- Validation: Proper validation prevents overfitting
R Programming
- Package Management: Efficient use of R packages
- Data Manipulation: dplyr for efficient data processing
- Visualization: ggplot2 for publication-quality plots
- Reproducibility: Set seeds for consistent results
Infrastructure Management
- Cloud Computing: DigitalOcean provides reliable R environment
- Resource Management: Monitor memory and CPU usage
- Backup Strategy: Regular backups prevent data loss
- Security: Proper authentication and access control
Future Enhancements
Advanced Analytics
- Deep Learning: Neural networks for complex pattern recognition
- Time Series: Advanced time series analysis for temporal data
- Text Mining: Analysis of wine reviews and descriptions
- Ensemble Methods: Combining multiple models for better predictions
Infrastructure Improvements
- Containerization: Docker for consistent environments
- Scalability: Kubernetes for large-scale processing
- Real-time Processing: Stream processing for live data
- API Development: RESTful APIs for model deployment
Conclusion
The Data Mining projects demonstrate comprehensive analysis of wine quality data using advanced statistical and machine learning techniques. Key achievements include:
- Statistical Analysis: Comprehensive exploratory data analysis
- Machine Learning: K-means clustering and predictive modeling
- Data Visualization: Publication-quality visualizations
- Infrastructure: Cloud-based R environment on DigitalOcean
- Reproducibility: Well-documented and reproducible analysis
The projects are available on GitHub and showcase practical data science workflows using R programming language.
These projects represent my exploration into data mining and machine learning techniques, demonstrating how statistical analysis can provide valuable insights into complex datasets. The lessons learned here continue to influence my approach to data analysis and machine learning applications.