This is a guest post by Vitalii Kamynin, an expert with more than 15 years of experience in developing software solutions for the digital transformation of the pharmaceutical industry. Founder and CTO of OMNI Digital and Infolek (until 2022), specializing in creating innovative products that automate and optimize business processes of pharmacies and pharmaceutical companies.
In the fast-paced AI world, training models is common, but ensuring uniqueness is challenging. The true advantage comes from using high-quality proprietary data instead of generic public datasets. An AI model’s quality depends on the data it is trained on.
However, data is rarely ready for use — it often contains errors, duplicates, missing values, and other shortcomings, which lower model accuracy and limit performance. According to a survey conducted by Anaconda, data scientists spend approximately 45% of their time on data preparation tasks; this includes loading and cleaning data.
Data cleansing and validation improve model accuracy, reduce distortions, and maximize analytics value. This article explores data cleaning principles to enhance AI quality and optimize business decision-making.
What Causes Inaccurate Data?
Inaccurate data from various sources leads to unreliable insights and flawed decisions. Data errors can occur at multiple processing stages, significantly impacting information quality and reliability. During data entry, human factors such as typos, accidental omissions, or duplication can lead to inaccuracies. Plus, inconsistencies in formats, units of measurement, or text encodings can cause discrepancies. During data collection, malfunctions in sensors and IoT devices, incorrect automated recognition, errors in API requests or web scraping, and duplicate records resulting from table merges or double-saving can introduce errors. Incorrect or incomplete data may arise from unfilled mandatory fields, data loss during transmission due to network failures or corrupted files, insufficient information from sources, intentionally corrupted data from bots or spam entries, or outdated information. Annotation errors, particularly critical in AI training, include incorrect labeling in classification tasks and human errors during manual tagging of images or text.
Why Is Data Cleaning Important?
It establishes accuracy, reliability, and efficiency in analytics, decision-making, and machine learning. Dirty data leads to incorrect insights, flawed predictions, and costly mistakes, while clean data improves operational efficiency and enhances collaboration by providing high-quality information.
Data quality affects AI and machine learning performance—errors can distort predictions and automation results. Poor quality leads to compliance risks and financial losses in finance healthcare.
By investing in data cleaning, organizations reduce risks, save time, and maximize data value, turning raw information into a strategic asset.
Such incidents are common and can lead to significant financial losses. A pharmaceutical distributor uploaded a price list to an online marketplace, but a data processing error removed one digit from all prices. The platform prioritized speed, lacked verification, and treated uploaded lists as binding offers. Minutes later, a pharmacy chain placed a massive order at prices five times lower than intended, totaling tens of millions of dollars. Fulfilling the order would have meant immediate bankruptcy for the distributor.
Tools
For each task, there is a tool. Python, with libraries like Pandas, NumPy, and OpenRefine are some among many. Pandas efficiently handles tabular data, managing missing values and duplicates, while NumPy supports numerical computations. OpenRefine is best-suited for fixing inconsistencies in text-based data.
For structured databases, SQL enables direct data cleaning through filtering, updating, and normalization. AI and machine learning assist in automating data quality tasks, such as filling in missing values, detecting anomalies, and categorizing data. Tools like Excel and Google Sheets are suitable for data cleaning and analysis for smaller datasets. Regular expressions (RegEx) are essential for pattern extraction in text and are widely used in Python, SQL, and Bash.
ETL tools like Apache NiFi, Airflow, and Talend automate large-scale data cleaning and integration. Bash and PowerShell provide powerful text-processing utilities (sed, awk, grep, cut) for managing large datasets, particularly in Unix-based environments.
General Practices in Data Cleaning
Data cleaning goes far beyond just correcting mistakes; it is a systematic process that starts with comprehending the data. Analysts review the dataset’s structure, format, and any inconsistencies before making changes. This initial assessment reveals missing values, duplicates, and formatting problems, enabling a more strategic method for data refinement. Skipping this step could lead to new errors instead of fixing the current ones.
In collaborative settings or when managing extensive datasets, it’s a big help to work with copies instead of modifying the original. By creating backups or using separate versions, you can avoid irreversible alterations and safeguard the integrity of the original dataset information.
Furthermore, data cleaning is performed with a clear purpose. Professionals consider how the data will be used in analysis, modeling, or reporting. Their cleaning efforts are directed by the end goal, ensuring that transformations align with business or analytical objectives instead of arbitrary data modifications.
Logs and Data Integrity Control
Maintaining a change log is a crucial practice in data cleaning and transformation. Tracking all modifications — including what changes were made, when they occurred, and why — ensures transparency and reproducibility. This documentation becomes particularly valuable when working in teams, allowing colleagues to understand past adjustments and avoid redundant efforts. An effectively maintained log aids in identifying errors, monitoring dataset evolution, and ensuring compliance through a clear audit trail purposes.
Ensuring data integrity is crucial. Using checksums and hashing techniques guarantees that data stays unchanged throughout transformations. Disc discrepancies can be identified early by creating and evaluating hash values before and after processing, avoiding unintended alterations or data corruption. This approach is especially valuable when dealing with sensitive information, migrating datasets between systems, or conducting extensive transformations where accuracy is critical.
Handling Missing Values, Duplicates, Outliers, and Standardization
Effective data cleaning requires a structured approach to addressing missing values, duplicates, outliers, and inconsistent formats. Understanding the nature of missing data is essential, as gaps can be either random or systematic. Depending on the cause, missing values can be filled using the mean, median, or predicted values through machine learning or, if appropriate, removed entirely. However, be cautious with zero values, which may carry significance and should not be treated as missing data. Avoid indiscriminately deleting rows with missing values, as this can result in the loss of valuable information — always evaluate their impact before removal.
When dealing with duplicates, verify whether records are truly identical or just similar. Simply removing them without verification may result in losing important details, such as customer contact information in merged datasets. Always ensure that duplicate resolution retains key attributes and maintains data integrity.
Another example occurred when duplicate entries in the pharmacy chain’s regional directory, combined with a misconfigured report and duplicated sales figures in certain regions, artificially inflating total sales. As a result, the company falsely exceeded the performance targets of a marketing contract with a pharmaceutical manufacturer, leading to excessive bonus payments and, ultimately, disputes, lengthy negotiations, financial losses, and reputational damage.
Outlier detection is another critical step, as extreme values — such as unrealistic salaries or ages — can distort analysis. To identify anomalies, use statistical methods like interquartile range (IQR) or standard deviation. Avoid automatic removal, as some outliers may reflect genuine events, such as peak sales days or exceptional customer behaviors. Analyze the context before deciding whether to discard or adjust outliers.
Finally, data standardization ensures consistency across datasets. Regular expressions (RegEx) help clean text fields like phone numbers and addresses. Convert text data to a consistent case to prevent redundant duplicates caused by capitalization differences. Ensure that numbers are stored as numerical values and dates follow a uniform format, particularly when consolidating data from multiple sources. Be mindful of encoding variations and measurement units—mixing currencies, length units (meters vs. feet), or temperature scales (Celsius vs. Fahrenheit) can introduce errors. Trimming unnecessary spaces and special characters improves text processing accuracy. Additionally, categorical data should be standardized by unifying equivalent values (e.g., “New York City,” “NYC,” “New York”). Infrequent categories should be consolidated under an “Other” label when necessary.
By methodically tackling these challenges, organizations can boost data quality, increase analytical precision, and refine machine learning models performance.
Data Validation: Ensuring Accuracy and Consistency
One of the most effective methods is implementing unit tests to catch critical issues early. These tests should focus on key aspects such as the presence of required values, correct data formats, and valid numerical ranges.
A pharmaceutical company spent several months receiving, processing, and analyzing reports from a distributor, unaware that sales data had shifted down by one row. As a result, sales figures were severely distorted, leading to incorrect analyses, flawed conclusions, and miscalculations of employee bonuses.
Data visualization is another powerful validation tool. Creating distribution plots, histograms, or summary tables helps identify outliers, inconsistencies, and potential errors that might go unnoticed in raw tables. Box plots showcase extreme values, whereas scatter plots illustrate unexpected correlations or gaps in data patterns.
External APIs can be leveraged for verification to further enhance accuracy. Google Maps API helps validate and standardize address fields, while phone number and email validation APIs ensure contact information is correctly formatted. This approach is useful for user-inputted datasets, reducing errors and improving data quality.
For complex datasets, it is best to divide the cleaning process into stages. For example, first address missing values, then handle duplicates, and finally perform formatting—this ensures a structured and efficient workflow. Additionally, it is important to check the relationships between columns. For instance, make sure that the date of birth corresponds to the specified age or that transaction timestamps follow a logical sequence.
Data cleaning transcends being a technical requirement; it is a strategic investment that influences business efficiency, decision-making precision, and overall financial outcomes. Enhancing forecast accuracy, boosting analytical reliability, and avoiding expensive errors are essential to upholding high-quality data.