Professor Nathan Taback shares the origin story of Data Science. And Professor Carolina Nobre provides insight into her research in Data Visualization. From Nathan and Carolina, we learn more about the components of data science, the importance of data visualization, the Data Science Specialist program at the University of Toronto, and what research and industry opportunities look like in the field data science.
[1:40] How did the field of data science come to be? Professor Nathan Taback shares its origin story.
[6:50] How are the fields of data science and computer science related? Professor Carolina Nobre, leader of the Human Interaction Visualization Lab, breaks down the components of data science.
[10:40] Studying data science at the University of Toronto. Learn more about the Data Science Specialist Program and the Data Science Institute. Carolina shares an example of a 4th year undergraduate research project in data visualization.
[13:30] Where can you go with a data science degree? We discuss opportunities in industry and academia.
[17:35] Diane and Mario summarise the key take-aways of the episode.
Mario: As you explore the courses and programs within computer science, you might have noticed Data Science among the course offerings and focuses. But what is it exactly? And how is it different from the other subfields of computer science?
Diane: In this episode, we’ll develop a deeper understanding of what data science is and its relationship to computer science by speaking with some professors who are data science experts.
Mario: Diane, what is data science to you?
Diane: Hm, that’s a good question. What is it to you?
Mario: Well, obviously, data science is an interdisciplinary academic field that uses statisti-
Diane: Wait. Are you just reading from Wikipedia?
Mario: Maybe.
Diane: sigh Mario, if we want this podcast to take off, we need the insight of experts.
Mario: Okay… how about the Special Advisor to the Dean of Arts and Science on Computational and Data Science Education.
Nathan: So, the term data science originated with Jeff Wu, who's a statistician at Georgia Tech now. And in 1985, he gave a lecture given to the Chinese Academy of Science in Beijing, where he used the term data science for the first time as an alternative to statistics.
Diane: That’s Professor Nathan Taback, who gets this episode started by telling us about the origins of data science as a field.
Mario: I don’t know about you, Diane, but I don’t remember hearing anything about data science in the 90s or even early 2000s.
Diane: Yes, it’s definitely being used more prominently today, though. Nathan shared the story of a mystery that contributed to data science becoming very popular.
Nathan: It really became popular in in the culture in the early 2000s. And that's when Google was a startup. In fact, in those times, Google was still trying to figure out how to be profitable, how do you make search profitable, and there was enormous pressure from their backers, who are largely venture capitalists at the beginning to monetize the search engine.
Mario: It’s probably hard for many listeners to think of Google as a startup.
Diane: Or to imagine a time where search wasn’t monetized!
Nathan: So, the team at Google that looked at the data logs, they arrived one morning in the office to find that a peculiar phrase, had risen to the top of search queries. And that was the search query: “What is Carol Brady's maiden name?”
Mario: Wait, wait, wait, hold up. I have a feeling some of our listeners may be a little lost as to who Carol Brady is.
Diane: Well, let’s not get ahead of ourselves. For now, you just need to know that Carol Brady was a character on a TV show from the 70s called The Brady Bunch.
Mario: Alright. Let’s resume the mystery that Nathan was unfolding.
Nathan: Ameet Patel, who recounted this in the in the New York Times article, he said that, you know, you can't interpret it, you can’t interpret the logs unless you know what else is going on in the world. And what they noticed is that this pattern of search queries had happened in these 45, each, each beginning at 45-minute intervals after the hour.
Mario: So, this was this was the same search for Carol Brady's maiden name separated by 45-minute intervals.
Nathan: That's right. That's right. So, they're thinking, what's going on? Why these 45-minute spikes, like, you know, spikes after 45 minutes. So, what they learned is that, you know, after a lot of talking, I think in the offices that there was a there's a there used to be a show called “Who Wants to Be a Millionaire”.
Mario: Our next history lesson! Not quite as far back as the Brady Bunch, though.
Diane: True! This was a game show where contestants were asked a series of multiple-choice questions. They lost as soon as they got one wrong, but if they answered every question correctly, they won a million dollars!
Mario: I’m guessing the question had to do with Carol Brady.
Nathan: Yeah, yeah ,yeah. The question had to do with Carol Brady. The contestant was asked, “What is Carol Brady’s maiden name?”
Mario: And so everybody, like, flocked to Google at the time.
Nathan: That’s right.
Diane: So there was a question about Carol Brady 45 minutes into the show. But that only explains a part of the mystery. Our listeners may be wondering: Why was Google seeing the same pattern every hour?
Mario: Well, there’s another piece of history you need to know here. This was the era before streaming services, so your cable television subscription played “Who Wants to be a Millionaire” at a specific time. Everyone watched the show at the same time.
Diane: And remember that we have time zones! So, the show aired at 7:00 PM Eastern Time, but it wouldn’t air in the next time zone for another hour. And so on.
Nathan: Right, so what they figured out is that they could actually figure out what people were doing when they were searching. That searches correlated with human behavior.
Diane: Let that sink in for a minute. An algorithm could infer, based on your search terms, whether you were watching “Who Wants to be a Millionaire?” That must have felt like an incredible realization.
Mario: Uh… incredibly spooky.
Diane: Yeah, I think we've all had those experiences, where things from one area of our online life show up in the advertisements we see in another area.
Nathan: Absolutely. Yeah. Yeah,
Diane: Someone’s paying attention.
Nathan: Someone's paying attention. Exactly. And I don't even think they knew that they were really, actually, paying attention, they being Google was, you know, could actually use this information that it actually represented something in the real world.
Diane: But this is a key moment in history where this discovery was made, and they turned it into something that made money for the company.
Nathan: That's right. That's right because they suddenly figured out how to sell advertising based on people's search queries.
Diane: So, they realized that searches correlated with human behaviour.
Mario: Data science is all about using data, like search queries, to infer things about the world, in this case what people are watching on TV. And if you think about all the different kinds of data that we have out there, and all the things we might want to infer, you start to get a grasp of how big this field really is.
Diane: So far, we’ve established that data science is about inferring things about the real world from data. But what are the tasks associated with doing data science?
Carolina: So, there are several different components within a science, and you could broadly say it is the process of capturing data, maintaining data, processing data, analyzing data and communicating data. So, of those five, data visualization really fits into the analyzing and communicating data.
Mario: That’s Dr. Carolina Nobre, a professor in the Department of Computer science. Her focus is on data visualization and she leads HIVE, the Human Interaction Visualization Lab.
Diane: Let’s reiterate those components: capturing data, maintaining data, processing data and analyzing data. And when you consider how large a dataset can be, a lot of computer science goes into even just storing the data. For instance, it may be distributed across a network of computers.
Mario: That’s definitely touching on some key computer science concepts! I’m hearing databases, networks, distributed systems. Is that data science?
Carolina: …the way I would very generally summarize it is that the computer science are the methods, it’s how we're actually going to use the computer to perform the tasks that as humans, we're interested in performing. Databases, machine learning, modeling all these things. Data science says, Okay, how can we use all these tools and techniques that computers give us to understand this world of data that we're living in?
Diane: And of course there is the statistics side to data science. Nathan talked about how computer science and statistics both contribute to data science.
Nathan: Statisticians were trained in the analysis of information, but not necessarily the extraction or the management of information. Computer scientists were trained in the, you know, maintenance and storage of information. But really, you needed those two to come together.
Mario: Machine learning is something that straddles both computer science and statistics. So is machine learning computer science? Is it data science?
Carolina: That's a great example of how it can be both depending on what the focus is, right? If you were developing a new machine learning technique, and what you're really interested in is in the algorithm and how it works, it's really a computer science kind of, you are developing the method, and that is computer science field.
If you are using different machine learning methods to process data and get insights and really focus on how different machine learning models potentially give you different types of insights on your data, the focus has shifted, your contribution is now understanding the data, and that's more of a data science field.
Diane: In our last episode, we learned that machine learning is very good at identifying patterns that humans might not see. But the opposite is also true.
Carolina: The human perceptual system is incredibly powerful, and data visualization allows us to take the best of both worlds, human perceptual system, and the power of computers, right, you can run all of the algorithms in the world. And yet, you can still show the results to a human, and they will see patterns that potentially your algorithm didn't catch.
They will see something that based on their expert knowledge of that data looks a little off that the algorithm couldn't possibly have caught because the algorithm didn't know that what you were expecting to see from that data. So, because of this, the power of data visualization and bringing those two together, it's become really prominent in the field of art in the subcomponent of data science, of analyzing data and then communicating that data to our viewers.
Diane: I hope this discussion has helped clarify what data science is and how it relates to computer science. Let’s talk next about how you can bring data science into your undergraduate degree.
Mario: Diane, a lot of students ask me what the difference is between a Data Science Specialist and a double major in Computer Science and Statistics.
Diane: This sounds like a question for Nathan.
Nathan: …the Data Science Program, which I've been involved with, it's really, the large part of it is a major in computer science, and a major in statistics. But with the added benefit of these integrative courses where students learn about the, how the, what I'll call the back end of data science, so the storage, acquisition, maintenance of data, and the analysis of data come together.
Mario: Nathan is referring to the JSC courses (JSC270, 370, and 470) that are exclusive to students enrolled in the Data Science Specialist program. Instead of just taking separate CS and stats courses, in these integrative courses, the CS and statistics are brought together.
Diane: But what if you aren’t in a data science program? Are there other ways to bring CS and Stats together?
Mario: One option is co-curricular activities, such as those provided by the Data Science Institute. For example, there is SUDS, the Student Undergraduate Data Science Opportunities program, where you can work with a faculty member to apply data science methods in a range of domains.
Diane: Another way to pursue a data science project is through course CSC494H1. It let’s you work with closely with a faculty member on a project for course credit. Carolina shared one example of a CSC494 project that she supervised.
Carolina: And so, for example, my undergraduate student currently in this, this class is developing a new way of drawing pie charts. Everyone knows the pie chart, some people love it, visualization designers hate it, let's make it better.
Mario: Carolina, what kind of impact could this project have?
Carolina: The scope of this project is to redesign the pie chart, implement it, and if we have time, we're going to run a little user study that says, hey, is the new pie chart better than the old pie chart, and for which tasks. So essentially, it is the idea, is making incremental progress towards something that can be considered a contribution to that field. Now, this is not necessarily something that is going to revolutionize the field, but it is improving something in some way that you can do over the scope of six months.
Diane: If you’re still wondering whether to apply to the Data Science Specialist program, Nathan sums it up well.
Nathan: I would encourage students to enroll in the data science specialist if they really do love both computing and statistics and they're really interested in data and how it's used in society and in business.
Mario: Nathan, where can you go with a data science degree?
Nathan: So really, it's any anything you can think of so, you know, finance, hospitals, research, going to graduate school.
Diane: Earlier today, I used a popular job board to look up data science jobs in Toronto, and the results were just as Nathan suggested: there were job postings in diverse industries including banking and finance, the game industry, food and beverage, huge tech companies, healthcare, pharmaceuticals, and retail.
Mario: And there is an option that marries graduate school with an industry focus. Our Master of Science in Applied Computing program, which offers a data science concentration, includes an eight-month paid internship where students bring the latest research they’ve been learning in their graduate courses to a company that wants to apply them.
Diane: It’s a very good program. And I’m getting a good picture of what it is like to study or work in applied data science. But I wonder what research is like in data science.
Mario: Carolina, can you share a little about your research?
Carolina: So, my research is in the field of data visualization. More specifically, I study how are people learning and extracting meaning from the visualizations that they see.
And so I also, in a related field, study how we can create intelligent user visualizations, so that these can adapt to the users that have different levels of cognitive traits, visual literacy, or just even different analysis goals when they're using these visualizations.
So, when we think of data science very broadly… data science is just research that relates to how do we capture and how do we make sense of data? And the field of data visualization, has this focus on how can we use visualizations to make sense of data capturing insights and communicating those insights.
Mario: Earlier in this episode, Carolina talked about re-designing the pie chart. That research was not limited to an implementation, but also included an evaluation of the re-design through a user study. We asked Carolina to explain how user studies fit into the field of computer science.
Carolina: When you explicitly bring the human into your research, that's where we get into the field of HCI. Right? HCI is the field, How do humans interact with computers? HCI is very broad, and you can say that data science is a part of HCI. Computer science is a part of HCI, but it really is that formalization of, I now want to research how all of these things that I've studied in my little lab and me and the computer, how does that really apply to real world users and humans?
Diane: A user study sounds fancy, but Carolina told us how an undergraduate student was able to set one up to evaluate a pie chart re-design using Qualtrics survey software
Carolina: In the field of visualization, there is often and I guess, in a lot of fields in computer science as well, when you make a contribution, you need to evaluate: was this contribution as good as I imagined, was it as effective? Was it useful? Was it usable?
In that context, we're running a user study where we give some people the old pie chart we give some people the new pie chart and we ask them to perform tasks. Find us the largest wedge, answer the question x y z. So, this is a very trivial user study that you set up in an interface such as Qualtrics, or one of these, where you simply give the due conditions, ask the questions and run it. So, it's very easy to implement.
And what I like about ending a project with an evaluation is that users, or in this case, our student has a real world experience of understanding, how do you evaluate the validity of your research contribution beyond your faculty member saying, Hey, this looks good, right?
Mario: Carolina has given us a glimpse at research in data visualization. But remember that data visualization is just one part of data science. There are researchers working on all the aspects of data science: capturing, maintaining, processing, and analyzing data.
Diane: So, Mario, we’ve learned about what data science is and how it relates to computer science, options for pursuing it through an undergrad program or project work, and options to work in data since in industry or pursue it through research. What are some key take-aways for you?
Mario: I loved how there were so many different aspects to data science – how you could work on under-the-hood, systems level support for data and algorithms, or work on the algorithms themselves, or visualization of results to communicate effectively. And also the fact that data science is being used in absolutely everything!
Diane: So true. And one thing we haven’t focused on is the value of bringing your other interests to bear. For instance, if you have a passion for environmental issues and perhaps some background in that area, other people will find that very valuable. It can play a key role in job or research opportunities.
Mario: Absolutely. One of the things I love about all of the programs we offer in computer science at the University of Toronto is that you can complement them with courses or programs in another domain that you’re interested in.