Data Foundations - I

What is Data?

Data is raw facts, numbers, or symbols—like text, images, or measurements—collected from the real world. It becomes useful when organized or analyzed to solve problems, make decisions, or train systems like machine learning models. Data is the backbone of machine learning and refers to any form of information that can be used for a specific purpose. It can be collected from a variety of sources, including finance, sports, medicine, the web, or even music. Depending on the use case, data can exist in different formats, such as tabular data (structured rows and columns), text, images, or even objects.


Types of Data

Here’s a breakdown of types of data from different perspectives (statistical, structural, and domain-specific):

1. By Structure/Format

Data can be categorized based on how it is organized and stored.

2. By Nature (Statistical Perspective)

From a statistical perspective, data can be classified based on its characteristics and the way it is measured.

3. By Machine Learning Relevance

a. Labeled Data

Labeled data consists of observations where each entry (or row in a dataset) is paired with an output label. These labels classify the data into specific categories or classes. For example, in a dataset of emails, each email could be labeled as either "spam" or "not spam." Here:

Labeled data is essential for supervised learning tasks, where the model learns to map inputs to outputs using these predefined labels.

b. Unlabeled Data

Unlabeled data consists of observations without associated output labels. For example, a dataset of customer purchasing histories may contain information about the products bought but no labels indicating customer segments. Unlabeled data is often used in unsupervised learning, where the goal is to find patterns or group similar observations without predefined categories.

c. Semi-Labeled Data

Semi-labeled data lies between labeled and unlabeled data. It includes a small portion of labeled observations alongside a large amount of unlabeled data. For instance, a dataset of images might have labels for a few images (e.g., "cat," "dog") while the rest remain unlabeled. This type of data is commonly used in semi-supervised learning, which helps models leverage both labeled and unlabeled data for improved performance.

4. By Time Dependency

Data can also be categorized based on whether it changes over time or remains constant.

5. By Domain/Application

Data can be categorized based on its field of use, with different industries handling specific types for various purposes.

6. By Scale (Big Data Perspective)

In the context of big data, data is categorized based on three key characteristics, often referred to as the "3 Vs":

These three factors define the challenges and opportunities in managing big data effectively.


Why is Data Used in Machine Learning?

Data is the foundation of machine learning (ML). Without data, machines cannot learn patterns, make predictions, or improve over time. In simple terms, ML is all about teaching computers to recognize trends, just like humans learn from experience. The more data an ML model has, the better it can understand and make decisions.

When we train an ML model, we provide it with training data, which includes examples with known answers. For instance, if we want to create a model that can recognize cats in pictures, we give it thousands of images labeled as "cat" or "not a cat." The model studies these images and learns to identify features, like shapes, colors, and patterns, that distinguish cats from other objects.

Data in ML comes in different forms, such as numbers, text, images, and sounds. A weather prediction model might use numerical data like temperature and humidity, while a speech recognition system uses audio recordings. The quality and quantity of data directly affect how well an ML model performs. If the data is incorrect, incomplete, or biased, the model may make poor predictions.

Another key reason data is essential is that ML models improve over time by learning from new data. For example, recommendation systems, like those used by Netflix or YouTube, analyze user preferences and constantly update suggestions based on new viewing habits. More data leads to better accuracy and personalized experiences.

In summary, data is what allows machine learning models to "learn" and make smart decisions. The more diverse, accurate, and relevant the data, the better the model will perform.


How to Collect Data for Machine Learning

Collecting data is the first and most crucial step in building a machine learning model. The quality and relevance of the data directly impact how well the model will perform. Data collection methods can be broadly classified into three types: Manual Collection, Using Pre-Existing Datasets, and Advanced Data Collection Techniques.

1. Manual Collection – Creating a Dataset from Scratch

Manual data collection involves gathering and creating a dataset from scratch. This is necessary when there is no existing dataset available or when specific data points are required for a project. Before collecting data, it is essential to define the requirements, such as:

Once the requirements are clear, the next step is to identify sources from which data can be obtained. The collection process varies based on the data type.

If the data needs to be extracted from websites, web scraping techniques can be used. Libraries such as BeautifulSoup (for parsing HTML), Scrapy (for large-scale scraping), and Selenium (for scraping dynamic websites) are commonly used. Another option is using APIs (Application Programming Interfaces), which allow software applications to securely access and exchange data. APIs like Twitter API (for social media data), OpenWeather API (for weather data), and Google Maps API (for location-based data) provide structured data from external sources.

For image data, the collection process might involve manually taking photographs or downloading images from verified sources. It is important to store images with proper labels to maintain an organized dataset. If the data involves power consumption records, one can create an Excel sheet, define the required columns, and manually note observations at regular intervals. Alternatively, IoT (Internet of Things) devices and edge computing can be used to automate data collection.

In large projects, data may be collected by multiple people from different sources. This raw data often needs to be combined and cleaned before use. The process of merging and standardizing data from multiple sources is known as Data Integration, which will be discussed in a later section.

Additionally, log files and system records are valuable sources of data. These logs, generated by software applications, mobile apps, or cloud platforms, contain historical records of activities and can be analyzed for insights.

2. Using Pre-Existing Public Datasets

Instead of collecting data manually, sometimes it is more efficient to use publicly available datasets. Many organizations and research institutions release open-source datasets that can be directly used for machine learning projects. Some popular platforms include:

Using public datasets saves time and effort, especially when training models for research or learning purposes. However, it is important to ensure the dataset is relevant, up-to-date, and unbiased before using it in an ML project.

3. Advanced Data Collection Techniques

For large-scale and complex applications, advanced data collection techniques are used. These methods help in gathering diverse and high-quality datasets for specialized machine learning tasks.

Crowdsourcing is one method where data is collected from multiple users worldwide. Platforms like Amazon Mechanical Turk (MTurk) and Clickworker allow users to contribute data by completing small tasks, such as labeling images or transcribing audio. Kaggle competitions also encourage data collection and annotation from a large community of data enthusiasts.

In cases where real-world data is limited or difficult to obtain, Synthetic Data Generation is used. This technique involves creating artificial datasets using algorithms. It is commonly used in computer vision (to generate training images), finance (to simulate stock market data), and healthcare (to generate synthetic patient records while maintaining privacy).

Another technique is Data Augmentation, which is used to expand existing datasets by applying transformations. For example, in image recognition tasks, images can be rotated, flipped, or slightly altered to create new training examples. In speech recognition, adding noise to an audio file can help models generalize better.

Edge Computing & Real-Time Streaming is another advanced method where data is collected from smart devices, drones, or self-driving cars. These devices process and send real-time data to ML models for instant decision-making. For example, traffic cameras collect live data to monitor congestion patterns, and smart home assistants continuously analyze voice commands.

Data collection is a critical step in machine learning, and choosing the right method depends on the project’s needs. Manual collection allows for customized datasets, public datasets provide a quick and reliable source, and advanced techniques enable large-scale and real-time data gathering. No matter the method used, ensuring data accuracy, diversity, and proper organization is key to building effective ML models.


Data Integration

Data integration is the process of combining data from multiple sources into a unified and consistent format, making it easier to analyze and use in applications like machine learning, business intelligence, and data analytics. It ensures that data from different origins—such as databases, cloud storage, APIs, and spreadsheets—is structured in a way that allows seamless access and processing.

For example, a company may collect customer data from various sources like website visits, social media interactions, and purchase records. If these datasets are stored in different formats and locations, integrating them allows the company to get a complete view of customer behavior, leading to better decision-making.

How is Data Integrated?

1. ETL (Extract, Transform, Load)

ETL is a widely used method that processes data in three steps:

🔹 Example: A retail company extracts sales data from different stores, transforms it to match a common format, and loads it into a data warehouse for reporting.

2. APIs (Application Programming Interfaces)

APIs allow real-time data exchange between systems without manual extraction.

🔹 Example: A weather app fetches real-time weather updates using the OpenWeather API, or an e-commerce platform integrates Google Maps API for delivery tracking.

Benefits: Real-time updates, security, and scalability.

3. Data Warehousing

A data warehouse is a centralized storage system optimized for analytics and business intelligence.

🔹 Example: A company stores financial data in Google BigQuery for performance tracking and forecasting.

Popular Platforms: Google BigQuery, Amazon Redshift, Snowflake.

4. Data Virtualization

This technique provides real-time access to multiple data sources without physically moving the data.

🔹 Example: A business analyst retrieves customer purchase data from different databases without duplication.

Benefits: Reduces storage costs, provides real-time insights, but may slow down queries for large datasets.

5. Master Data Management (MDM)

MDM ensures a single, consistent, and accurate version of key business data across multiple systems.

🔹 Example: A company unifies customer records across CRM, sales, and support databases to avoid duplicate or outdated data.

Why Use MDM? Improves data quality, prevents inconsistencies, and ensures compliance with regulations.

Challenges in Data Integration


In the upcoming sections, we will delve further into Data Foundations by exploring Data Distribution, which is a crucial aspect of this subject. Additionally, we will discuss Data Preprocessing, a necessary step to prepare data before feeding it into a Machine Learning model.

Next >