Non-Negotiables for Your Data

Essential
Tutorial
Author

Giorgio Luciano

Published

September 24, 2024

Non-Negotiables for Your Data

In today’s data-driven world, it is essential to understand the difference between a spreadsheet and a database, as well as the importance of proper data management. This post will explain why Excel should not be used as a database and provide key guidelines for effectively managing and analyzing data.

Why Excel Is Not a Database

While Excel is a powerful tool for many tasks, it’s important to recognize its limitations when it comes to data management. Here are some key reasons why:

  1. Limited Data Integrity and Validation
    Excel allows users to input any type of data in any cell without strict consistency checks. Databases, on the other hand, enforce data types and restrict entries based on predefined rules.

  2. Lack of Robust Security Features
    Databases provide more advanced access control, including roles, permissions, and activity logs. In Excel, anyone with file access can modify the data without any trace.

  3. Difficulty Handling Large Datasets
    As data grows (millions of rows or hundreds of columns), Excel becomes inefficient and prone to crashing. Databases are designed to handle large datasets efficiently.

  4. Limited Multi-user Collaboration
    Although Excel offers sharing features, version conflicts can easily arise. Databases are built for simultaneous collaboration without data loss.

  5. Lack of Advanced Querying and Reporting Tools
    SQL queries in databases allow for far more complex analyses than Excel can handle. Additionally, databases offer advanced reporting features with Business Intelligence (BI) tools.

For serious data management needs, it’s advisable to use a dedicated database system like MySQL, PostgreSQL, or MongoDB.

Key Guidelines for Data Management

When someone provides data for analysis, it’s important to adhere to certain standards to ensure accurate and efficient analysis.

1. File Naming Conventions

File names should be clear, descriptive, and free of special characters. Using meaningful names is essential to ensure that anyone can understand the content of the file without having to open it. Avoid vague or generic names like “database_latest_version” or “final_file”, which provide little context.

Here are some best practices for naming files:

  • Good examples:

    • “yyyymmdd_trial_name_author_revision”

    • “hospitalXYZ_patient_data.csv”

  • Examples to avoid:

    • “final_final_version_file.xlsx”

    • “data123.xls”*

Avoid special characters like spaces or symbols that can cause issues in different systems and software. Characters to avoid include:

  • /, \, *, &, #, %, @

Instead, use underscores (_) to separate words (e.g., “correct_file_name.xlsx”).

2. Column and Row Naming Conventions

Just like file names, column and row names should follow the same rules of clarity and consistency. Meaningful column names help everyone immediately understand the data without needing extra explanations.

Here are some guidelines for naming columns and rows:

  • Use meaningful names:
    Columns should clearly describe the data they contain. Instead of using generic names like “var1” or “column1”, use more specific names like “patient_age” or “quantity_sold”.

  • Good examples:

    • “patient_name”, “Vitamin D” , “Cholesterol”
  • Examples to avoid:

    • “column1”, “data_A”, “x1”
  • Avoid special characters:
    Like file names, avoid characters like spaces, dashes, or special symbols (“/”, “&”, “#”, etc.). Use underscores (_) or camelCase (e.g., “customerName”) to separate words in column names.

  1. Case Sensitivity
    Be intentional with the use of uppercase and lowercase letters to differentiate categories, data types, or names. Case sensitivity can be important, especially when working with case-sensitive databases or programming languages.

  2. Empty Cells
    An empty cell has meaning. Is it an unmeasured value? An out-of-scale value? A missing value? Or data that cannot be retrieved? Always clarify the meaning of empty cells to avoid ambiguity.

  3. Zero vs. Missing Data
    Zero is different from missing data. Ensure that this distinction is clear in the dataset, perhaps by using a specific value (e.g., “NA” or “null”) to indicate missing data.

  4. Consistent Numeric Notation
    Commas and periods have different meanings in numerical notation depending on the region (e.g., 1,000 in the US means “one thousand,” whereas in Europe it may mean “one”). Use consistent and clear numeric notation throughout the document.

  5. Avoid Duplicates
    There should be no duplicate rows or columns unless there is a specific and justified reason for their presence. Duplicates can skew analysis results.

Following these guidelines will ensure your data is clean, consistent, and ready for accurate analysis. Remember, good data management practices are the foundation for reliable research and decision-making.