How to do a Data Analytics Project
What do data analysts actually do? How do they approach their work?
If you’re interested in breaking into the data analytics field, a common piece of advice is to create a portfolio of projects using data. But if you have no experience, you might not know how to approach projects. What steps should you follow for your portfolio projects?
And then once you’re on the job, how should you approach projects?
Below is a loose framework that I recommend following.
- Start with a problem
- Identify the right data to use
- Determine the scope of your solution
- Clean and prepare the data
- Explore it
- Solve your problem
- Summarize
Start with a problem
One mistake I see new folks make is starting with a dataset and diving in with no plan. But if you don’t know what problem you’re going to solve, how do you know what to do? Some folks think once they start digging into the data, it will “tell them something.” But how can it “tell you” anything if you aren’t asking anything? Data is not sentient. If you’re just digging in to dig in — how do you know when to stop?
At work, I always start with a problem or question. Often, it’s coming from a stakeholder, or it’s a question that I formed out of another project. I make sure I understand the “why” — why do we care about this problem or question? How does it relate to our business or our goals? I ask any clarifying questions of the stakeholder — what will the answer enable you to do? Is there are certain timeframe we’re looking at, or a certain segment of our users/data?
Identify the right data to use
Now that I know the problem I’m trying to solve, I can look for the best data source to use. At work, that usually means a data source that we’ve collected. For personal projects, there are many free datasets that you can download.
Before downloading the data, make sure you understand it. Is there any documentation available explaining each variable? If not, is there a subject matter expert you can reach out to? Is one data source enough or do you need to join data together, and if so, do you know what column(s) to join on?
If you’re joining data together, make sure it represents the same thing — the same timeline or locations or users, for example.
Determine the scope of your solution
Before starting your project, make sure you understand the scope.
- What is the deliverable the stakeholder expects? A dashboard or machine learning model put into production, or just a summary of your findings?
- What questions will you need to have answered before you can consider the project to be “done”?
Depending on the scope, I like to “time box” myself so that I don’t go down too many rabbit holes. Depending on the timeline dictated by my stakeholder, I might give myself until the halfway point to get a first draft solution that I can present for feedback and then make updates as necessary for the final deadline.
Clean and prepare the data
Finally, I can query or download the data and clean it.
If you’re doing your own query, make sure to think about:
- What timeframe you should look at
- Other filters to limit the view of the data to what is relevant
- Any variables that will help you segment or group your data — what is relevant to your quesiton or problem? Examples can include gender, age, and other demographic factors, location, customer or user type, technology type, etc.
Once you have your data, it should be cleaned and prepared. Even the cleanest and most perfectly collected data often needs some preparation — creating new variables or aggregating data. And often after doing the next step — exploration — you might find more that needs to be cleaned.
Some basic steps for cleaning and preparation:
- Check for missing values. Decide if you can leave them as NULL, if you should replace them, or if you should drop that row or column.
- Check for incorrect data types. Are your dates actually formatted as dates? Are all your numeric values actually integers or floats?
- Check for distributions of numeric variables. Are there outliers? Decide what to do with them — transform them, drop them, or leave them as is.
- Check your categorical variables — is anything wrong? Any unexpected values? Any misspellings or inconsistencies?
- Check for correlations — how are numeric values related? If there are highly correlated values, is multi-colinearity going to be an issue?
- Are there any variables you don’t need? Drop them or create a version 2 of your data without those columns. (I create a new version in case I realize later on that I did need one or more of those columns.)
- Do you need to create any new variables? For example, do you want to create bins for any numeric variables? Create any new calculated metrics based on existing columns?
Explore your data
Before getting to the problem you’re trying to solve, familiarize yourself with the data. Some common things to look at:
- Check the distributions for any numeric variables. A histogram is a good way to visually do this.
- Check the count, mean, median, standard deviation, minimum, maximum, and quartiles of your numeric variables. Some packages in Python or R will do this in one line of code.
- Check the count of your categorical variables.
- Compare the numeric values when grouping by categorical variables.
- Check for correlations of your numeric variables — what’s the numeric value or visualize with scatterplots.
If anything “weird” came up in your exploration, it’s fine to do more data cleaning at this point.
It’s easy during data exploration to go down every rabbit hole and analyze the data in many different ways. Sometimes you reach “analysis paralysis” where you have too much information and aren’t sure what conclusions to make. This is why it’s important to have an initial problem or question you’re trying to solve. If you feel overwhelmed by information, take a step back and remember what you’re trying to accomplish and what is most relevant to the stakeholders for this project.
Solve your problem
Now that you have clean data that you are familiar with, solve your initial problem or answer your initial question, via the deliverable that you agreed upon with your stakeholder.
If you agreed on a dashboard and/or your stakeholder is going to need updated data in the future, build a dashboard that is clean and easy for them to navigate. Use headings and labels to make it easy for a viewer to understand what they are looking at. Add filters so they can self-serve different views of the data. Add a link to a document that defines all of the variables in your dashboard.
Otherwise, if it’s a one-time analysis, use whatever tools you think best. I often work in SQL + Python. Sometimes doing all of the exploratory steps above is enough to answer the question. Or I’ll do a predictive model — maybe linear or logistic regression or a tree-based model like Random Forest — and analyze the coefficients of my varaibles to understand feature importance and how independent variables impact the outcome of a dependent variable.
Summarize
The amount of “summary” you need to put together depends on the project and the audience.
If you’re doing a dashboard, sometimes all you need to do is share a link to where it can be accessed, although usually you need to walk your stakeholder(s) through what is included in the dashboard, how to use it, how often the data refreshes, etc. And they will likely have questions. It can help to come up with a few use cases, or examples, that you walk them through to demonstrate how they can use the dashboard to answer their ongoing questions.
Often it’s neceesary to summarize your work in a PowerPoint slideshow. Even if you are creating a personal project for your portfolio, I recommend summarizing it in PowerPoint, which you can upload to GitHub.
One framework for the summary is the S.T.A.R. method. This is also good for talking about your projects in an interview:
- Situation: What is the business problem you’re trying to solve or question you’re answering? Why is it important?
- Task: What is the goal or intended outcome?
- Actions: What steps did you take? This section might have more or less detail or technical explanations depending on your audience.
- Results: What was the actual outcome? What are your insights? What are your recommendations? How can your audience use this information to improve the business?
For personal projects, I use GitHub to store my data as well as my summary. For work projects, I also save those to my team’s private GitHub repository or summarize on Confluence (like a company’s Wiki). This will make it easy to share or refer to it in the future (just share a link to the specific repository or page).
Need more help? Read my tips for what makes a good data analytics project. And check out more resources for how to do a data project.
Want more career advice? Follow me on TikTok, Instagram, or LinkedIn, and sign up for my free data career newsletter.