There is no modern career that gets more buzz than data science. And why shouldn’t it? The data revolution is well underway. Companies like Amazon and Facebook have become corporate giants by riding and guiding this revolution. Every company wants to hire a data scientist to avoid missing the next big thing. This has created such a huge demand that the title “data scientist” now offers a pay premium that is hard to ignore.
But it’s not a field that is right for everyone. Then again, as I would hope we are all aware, no career is right for everyone.
So let’s break it down and try to answer these two questions: Who can become a data scientist? Who should become a data scientist?
What Is a Data Scientist?
Not the question I said I was going to answer, but it’s hard to go too far without this. Both the title of data scientist and in fact the field of data science are ill-defined.
What do you think of when you hear data science? For me, it was images of machine learning algorithms, artificial intelligence, neural networks, and huge amounts of data. I went into my first job expecting to be handed a giant dataset, asked to make a prediction using whatever complicated algorithm I could find, and to hand the algorithm off to a developer to productionize. Well, that was wrong. Instead, I found myself spending most of my time trying to understand business requirements, trying to figure out what data our company had that was useful and what it was useful for, and then building and productionizing software. So in a sense, I was:
- A product owner
- A data analyst
- A software engineer
Then at my second data science job, I found myself having to design not just the software, but the systems themselves and their connections to the rest of the company’s systems. Thus, on top of the above, I was also:
- A cloud architect
- A data modeler/architect
- A system administrator
- A data engineer
- A reporting analyst
I can’t believe I got this far without talking about data cleansing. Even once the systems and pipelines are set up and the data is all in place, you will still spend 90% of your remaining time cleaning data. Think this is hyperbole? Ask around. You’ll find that 90% is probably better described as a lower bound.
There are places where the job is kind of like what I imagined. (Those jobs should be titled Machine Learning Engineer, not Data Scientist.) I interviewed for a few of them, and a couple of my friends still work at such places. I am sooo glad that I didn’t get those jobs because I would have gotten bored very quickly.
Most companies don’t know what a data scientist does, whether they need one, whether they are ready for one, and what skills they should be looking for when they hire one. They know they have data and they have become convinced that they need to garner additional insights from it. In most cases, they should probably be hiring an architect or an engineer, but the data science title is the one that they are familiar with.
So what is a data scientist? It’s whatever the company decides they are hiring. Look at the job requirements. Ask the company what their current data infrastructure is. Talk to anyone that already works with the data there if you can. Don’t assume you know what the job entails from the title.
The 3 Areas of Knowledge Every Data Scientist Needs
Not to go all No True Scotsman on you, but any real data science job requires these three areas of knowledge.
Data science is all about understanding data, and statistics is the science of collecting and analyzing data, so this one shouldn’t be surprising. But I’ve seen plenty of candidates who tried to get a data science position with no real statistics knowledge. You don’t have to be able to do complex statistical derivations, but you should have a STEM degree level of stats knowledge. And you better understand experimental design. In fact, you better live experimental design.
Want to know if you are ready? Try answering the following questions.
- What is the importance of the law of large numbers?
- What is the difference between independent variables and uncorrelated variables?
- What is the probability of pulling both a 3 and a 4 from a deck of cards in that order? What if order didn’t matter?
Not a complete set, to be sure, but if any of those gave you pause, it might be time to crack open a stats book. An Introduction to Statistical Learning is a fantastic option and is available as a free PDF.
Alternative: if statistics is the deal-breaker for you, consider data engineering instead of data science. Those jobs tend to require significantly less stats knowledge, but you will not be doing any data modeling since that is very heavily driven by statistics. You may also consider more of a software development role.
You won’t always be expected to sit down and code a production-ready software package, but you will be expected to know how to write scripts. Pick up Python (preferred) or R and you’ll be fine. Become familiar with Unix and git if you aren’t already, as you’ll almost certainly be using those.
Alternative: Don’t want to learn to program? Data analytics may be for you. All the data fun, but you’ll be working in software packages that somebody else built (let’s be honest, it’ll be excel).
Oh yea, a data scientist definitely needs to know the domain they are working in. I don’t often quote Donald Rumsfeld, but when I do, it’s always about unknown unknowns.
“…there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know.” — Donald Rumsfeld
When you don’t have domain knowledge, you can still do a good job of dealing with known unknowns, but unknown unknowns are going to get you. Even with domain knowledge, they are still a problem, but domain knowledge gives you an edge in identifying and understanding them.
When I worked in vacation rentals, there was a company that a lot of property managers used to handle their dynamic pricing. I did not know this. I won’t go into too much detail, but I spent far too much time chasing what I thought were some exciting results. I thought I had invented the wheel! But really someone else had invented the wheel, and I was just the guy that “discovered” what most everyone in the industry already knew existed. That’s why domain knowledge matters to a data scientist.
Alternative: Do you remember when I mentioned Machine Learning Engineers before? That’s the title you probably want if you don’t want to get too involved in domain knowledge.
Don’t Forget the Non-Technical Skills
A good data scientist is a problem-solver, a story-teller, and an excellent communicator, both verbally and in writing. You are often going to be involved in the entire process, from problem identification to solution implementation and communication of results.
They are curious, creative, and persistent. They play well with both technical and non-technical co-workers. They think like scientists but also like business people. Sound like your cup of tea?
Can I Become a Data Scientist?
If you have the statistics and programming know-how, have or are willing to gain the domain knowledge, can solve problems like the best of them, and can communicate even with the worst of them, then just maybe you could be a data scientist. If you have a STEM degree, that will be a plus. If that degree is a masters or doctorate, even better.
And even if you don’t have the right degree, there are plenty of places that will want you as a data scientist if you show that you have the talent and experience, those places just may not be the big banks or insurance companies that can afford to hire anyone using any arbitrary standard they want. Instead, aim for a startup or a small business if you can’t get hired with the big guys.