Can you choose different public data sources quickly, efficiently, and use that data for data science too? Finding correlations in public data sources used to be a lot of work to wrangle.
What I said above, for most, would require multiple people.
- Data architect – helps you architect your data
- Database admin – helps you access your data
- Data scientist – helps you with predictive analytics
- Developer – migrates data science models into code
- BI Developer – helps you build a dashboard
We always intend to find correlations in public data sources.
Finding correlations in public data sources is a lot like finding a pattern on your wallpaper.
It takes consideration, time, planning, counting, memorizing, anyways…
But now we can automate most of this stuff 😉
Our prediction is that ethnicity, deaths, births, birth fertility, homelessness, and dozens of more measures will offer insights into average survey scores.
The more our measures correlate, the better we can predict our future.
Remember before reading the findings that statistics are opinionated. Correlation does not mean causation.
Noticing trends in our data sources.
Here’s a scatter plot displaying HCA survey average scores per state and Total Homelessness in 2014. The points on the graph are individual states. You will notice a negative trend sloping down as the survey scores increase, the amount of homeless people decreases too.
Scoring correlations in our data sources.
If we score the correlation between the two values, across every state, it would enable us to determine what metrics follow a similar pattern.
With scores, we can understand what correlates the most, or does not correlate to the hospital survey scores.
There are hundreds of hospitals per state. Here’s the top ten bar chart, showing how many hospitals per state are being averaged.
Mastering Multiple Public Data Sources
If you want to bring multiple data sources together, you need to be very good at SQL, data manipulation, data architecture, and a lot of spare time in a spreadsheet if this sounds like new skills. Finding correlations is another beast.
Finding a data scientist in your office may be tricky if they are busy or not available, which generates a deep dive in learning R, Python, Statistics, or Excel martial arts.
Luckily, we KNIME our data, which means we can do all the complex SQL, data manipulation, data architecture, and pipe it through a data science model, in the same window.
What would have taken my group multiple weeks of constant work, only took us 1.5 hours!
Looking for a public data source?
Know that picking a public data source has become readily available in the past few years, and being able to quickly identify powerful data sources has become a distinguishable skill to master.
If you are interested in picking up this skill, read our solution below.
I stumbled on this request in a team project last week!
“Everyone, pick multiple data sources…”
This can sound like an easy request…
- Where do you get your data?
- What data is accurate?
- What source of data can we trust?
- What decisions are we attempting to make?
We want to find data sources that will add value to our audience, their questions will have answers, and their next steps will be clear.
In this blog, you will learn how…
You will learn how our team quickly accomplished our project using one simple whiteboarding methodology.
You’re probably thinking, ‘Yes, of course, I can pick a data source, I’m passionate about x, y, z.’
The more you grow in the data industry, the more you realize every new data source comes with new hurdles or barriers to entry.
Is the data clean? Is the data usable? Do we have access to the data?
You begin applying padding around your work, based on previous experiences, and avoid painful experiences.
If you choose to read this entirely, you will be armed with a solution to any brainstorming meeting, and you will learn where we found 3 different data sources online.
That’s right, I’m talking about your future in data.
Eventually, you’re going to hit that ten-year threshold. A decade of database-related solutions.
When you hit ten years, you start to become scared of ‘lots of disparate data sources,’ and learn to avoid these workloads.
When your meeting is going great… Get ready for that meeting to slow down.
Someone is around the corner to suggest another table of data…
The data source will put the brakes on your progress. A new data source that has never been cleaned, prepped, approved… Major ‘nope’ in my line of work…
You were asked, “What data source do you want to use?”
Feel free to play along, mentally put yourself in a group setting, and give yourself a few minutes to pick multiple data sources to present in front of a class of 40 professional peers.
You will be building decisions out of these data sources. Graphs, pie charts, forecasting, insights, etc.
Lastly, work as a team to generate a data solution surrounding these data sources. While you’re at it, don’t get too involved in figuring out what answers you will gain from your data source.
We were tasked to generate a project to kick-start our success stories. In this process, it seemed really challenging to decide on what to choose.
It was interesting because even though we were all eager to get started… We struggled to find the right granularity.
You have access to making documents aka superpowers!
Data lakes, big data, data warehouses, … Each row of data offers a granularity.
Are we looking for an address? Zip code? State? Country?
Stating different parent and child relationships are all great and good. But until these words hit a piece of paper, Google Slides, or a whiteboard… We are left remembering what was said and making decisions based on our memory.
A zip code would be a more granular view of your data than State…
But without documenting our decisions, how do we collectively come together to suggest a path forward?
Also, why am I talking about documents?
Well – documentation is really helpful and it’s tangible information you can point at and say, “LOOK”… and in return, you will likely receive feedback.
Finding multiple data sources in any single industry can feel like a daunting task.
When you’re actively finding multiple public data sources – there are a lot of options.
Noticed we just said “public data sources”…
Yes, as opposed to a private data source! If you’re new in the data industry, let me dive in.
- Public data source – safe to download, safe to share, available for everyone.
- Private data source – not safe to download, not safe to share, not available for everyone.
Here are a few up and coming data source providers. These are public data sets and available to everyone.
Remember – public data sources are like a Wiki.
Here’s my ‘buyer beware.’ Or maybe we should say ‘extracter beware.’
Anyone can add a data source, anyone can change the data source, and Wiki’s are public – anyone can make an edit. That’s why teachers recommend we avoid using them in scholarly reports because some random person likely added that information on the Wiki.
How did our team pick a data source?
Originally, we struggled to find a data source because we were not documenting our ideas.
Our ideas were lost as soon as we said “anything” out loud.
We would say a great idea, agree it was a great idea, but failed to generate next steps.
What made a difference was documenting our brainstorming session on the dry erase board.
Having everything on the whiteboard allowed everyone to see what was being said.
Not sure how to get started whiteboarding?
Whiteboarding is usually as simple as making a word cloud and circling good items and drawing a line through bad items.
- Reblogged – https://medium.com/@tylerkeamogarrett/finding-correlations-in-public-data-sources-7cb89bcc7067
- Ow.ly – http://ow.ly/Betj30kc2fa
- Bit.ly – http://bit.ly/2GPW927
- Goo.gl – https://goo.gl/wTTdcn
- Wp – https://wp.me/p9TjFw-97
- Tiny – http://tiny.cc/r6k0ty
- Google+ – https://plus.google.com/u/0/+tylergarrett/posts/JPneEAfQtzE
Learn more about advanced analytics on my other blogs.
For example, on my LinkedIn, I write about spatial filtering Google analytics data.