1 00:00:00,500 --> 00:00:01,984 [SQUEAKING] 2 00:00:01,984 --> 00:00:03,968 [RUSTLING] 3 00:00:03,968 --> 00:00:07,440 [CLICKING] 4 00:00:16,337 --> 00:00:18,920 VIJAY GADEPALLY: I hope you all are enjoying the class so far. 5 00:00:18,920 --> 00:00:20,730 It's always interesting-- we have 6 00:00:20,730 --> 00:00:22,440 people with a variety of backgrounds 7 00:00:22,440 --> 00:00:26,130 here, so it's just great to kind of hear the questions as we're 8 00:00:26,130 --> 00:00:27,460 going along. 9 00:00:27,460 --> 00:00:30,870 So I'm going to give a quick example 10 00:00:30,870 --> 00:00:33,180 to begin the class right now that kind of drive 11 00:00:33,180 --> 00:00:35,338 in some of the concepts that we talked about 12 00:00:35,338 --> 00:00:36,130 on the first class. 13 00:00:36,130 --> 00:00:37,770 But this is kind of real rather than 14 00:00:37,770 --> 00:00:39,860 sort of cartoon neural networks. 15 00:00:39,860 --> 00:00:43,950 But it's sort of a real research result that we have. 16 00:00:43,950 --> 00:00:45,627 Before I begin, I'd like to thank 17 00:00:45,627 --> 00:00:47,460 Emily Do who was one of my graduate students 18 00:00:47,460 --> 00:00:48,770 who actually did all the work. 19 00:00:48,770 --> 00:00:50,940 I just put some of the slides together-- not even 20 00:00:50,940 --> 00:00:52,500 all of them, just a few. 21 00:00:52,500 --> 00:00:55,920 So really, the credit for this work goes to Emily. 22 00:00:55,920 --> 00:00:59,580 Anything I misrepresent is my fault, not hers. 23 00:00:59,580 --> 00:01:03,330 She graduated, so I'll take the blame for anything 24 00:01:03,330 --> 00:01:04,330 that's not interesting. 25 00:01:04,330 --> 00:01:07,680 So with that, we'll begin. 26 00:01:07,680 --> 00:01:09,780 All right, so the overall goal of this project 27 00:01:09,780 --> 00:01:12,180 was really to detect and classify network attacks 28 00:01:12,180 --> 00:01:13,990 from real internet traffic. 29 00:01:13,990 --> 00:01:16,950 And in order to do this, as many of you can imagine, 30 00:01:16,950 --> 00:01:19,000 we had to find a data set that was of interest. 31 00:01:19,000 --> 00:01:20,370 So this is probably a problem many of you 32 00:01:20,370 --> 00:01:21,840 are currently thinking about, which 33 00:01:21,840 --> 00:01:24,810 is what data set should I get my hands on? 34 00:01:24,810 --> 00:01:26,700 Right, so we have a variety of data sets. 35 00:01:26,700 --> 00:01:28,380 Some are sensitive in nature, right? 36 00:01:28,380 --> 00:01:31,140 So think of internal network traffic 37 00:01:31,140 --> 00:01:32,520 that we're trying to collect. 38 00:01:32,520 --> 00:01:36,600 No one's going to let us hand this over to graduate students 39 00:01:36,600 --> 00:01:39,565 to kind of work on, and then more importantly, publish. 40 00:01:39,565 --> 00:01:41,190 So the first thing that we wanted to do 41 00:01:41,190 --> 00:01:45,060 was look for a data set that was kind of open and out there. 42 00:01:45,060 --> 00:01:48,480 And we're fortunate that there is a group in Japan called 43 00:01:48,480 --> 00:01:51,420 the MAWI working group, which stands for the measurement-- 44 00:01:51,420 --> 00:01:53,270 and I'll use my cool tool-- 45 00:01:53,270 --> 00:01:55,500 the measurement and analysis of wide area 46 00:01:55,500 --> 00:01:57,450 internet traffic working group, that 47 00:01:57,450 --> 00:02:00,000 actually has 10 gig link that they've 48 00:02:00,000 --> 00:02:01,868 tapped using a network tap. 49 00:02:01,868 --> 00:02:03,660 And they actually make this data available. 50 00:02:03,660 --> 00:02:07,060 It's actually continuously updated even to today. 51 00:02:07,060 --> 00:02:10,919 So that's really cool, and this is within that. 52 00:02:10,919 --> 00:02:14,460 It's called the Day-in-the-Life of the Internet. 53 00:02:14,460 --> 00:02:17,670 And so this has been going on for multiple years. 54 00:02:17,670 --> 00:02:20,190 The data set is reasonably large when you convert it 55 00:02:20,190 --> 00:02:22,740 into what we call an analyst friendly form, which 56 00:02:22,740 --> 00:02:26,130 is something that you or I could read and make some sense out 57 00:02:26,130 --> 00:02:26,630 of. 58 00:02:26,630 --> 00:02:28,310 It's about 20 terabytes in size. 59 00:02:28,310 --> 00:02:30,190 So a reasonably large data set, not 60 00:02:30,190 --> 00:02:33,450 going to fit on a single computer or a single node, 61 00:02:33,450 --> 00:02:36,960 so certainly a good use case for using high performance 62 00:02:36,960 --> 00:02:38,940 computing or supercomputing. 63 00:02:38,940 --> 00:02:41,597 And one additional piece that we should note-- 64 00:02:41,597 --> 00:02:43,680 and this is something that, as you're starting off 65 00:02:43,680 --> 00:02:45,750 your projects, you may also think about-- 66 00:02:45,750 --> 00:02:49,740 IP addresses are often seen as reasonably sensitive. 67 00:02:49,740 --> 00:02:52,450 So you can see where traffic is coming and going from. 68 00:02:52,450 --> 00:02:54,210 So in this particular data set, they've 69 00:02:54,210 --> 00:02:57,600 actually deterministically anonymized 70 00:02:57,600 --> 00:02:59,490 each of the IP addresses. 71 00:02:59,490 --> 00:03:01,380 And as you're downloading the data, 72 00:03:01,380 --> 00:03:04,080 you have to, essentially, sign a user agreement saying 73 00:03:04,080 --> 00:03:07,710 that you will not attempt to kind of go back and figure out 74 00:03:07,710 --> 00:03:09,660 what the original IP addresses were. 75 00:03:09,660 --> 00:03:11,490 So as you're coming up with the data sets-- 76 00:03:11,490 --> 00:03:14,230 and I think Jeremy is going to talk a lot more about this 77 00:03:14,230 --> 00:03:15,780 in the next part-- 78 00:03:15,780 --> 00:03:18,240 you might think about, you know, are there certain fields 79 00:03:18,240 --> 00:03:21,360 in the data that I'm collecting that might be deemed sensitive, 80 00:03:21,360 --> 00:03:23,170 either today or in the future? 81 00:03:23,170 --> 00:03:25,080 And if so, are there just simple techniques 82 00:03:25,080 --> 00:03:28,080 that I could use to anonymize the data? 83 00:03:28,080 --> 00:03:31,170 And then with a decent end user agreement, 84 00:03:31,170 --> 00:03:34,650 you might be able to enforce some level of people 85 00:03:34,650 --> 00:03:37,540 not trying to break it and trying to go back. 86 00:03:37,540 --> 00:03:40,700 It's not impossible to do it, but any legitimate researcher 87 00:03:40,700 --> 00:03:42,570 who is using data really shouldn't 88 00:03:42,570 --> 00:03:44,580 care about what the original IP addresses 89 00:03:44,580 --> 00:03:47,880 in this particular case are or maybe other such data 90 00:03:47,880 --> 00:03:50,460 within whatever your collecting. 91 00:03:50,460 --> 00:03:52,205 All right, so just a quick definition 92 00:03:52,205 --> 00:03:53,080 of anomaly detection. 93 00:03:53,080 --> 00:03:56,880 I really love this definition of what an outlier is 94 00:03:56,880 --> 00:03:58,835 from a paper by Hawkins. 95 00:03:58,835 --> 00:04:00,210 So, "An outlier is an observation 96 00:04:00,210 --> 00:04:02,580 that deviates so much from other observations 97 00:04:02,580 --> 00:04:05,160 as to arouse suspicion that it was generated 98 00:04:05,160 --> 00:04:06,970 by a different mechanism." 99 00:04:06,970 --> 00:04:08,790 So when we're looking at network traffic, 100 00:04:08,790 --> 00:04:10,040 that's what we're looking for. 101 00:04:10,040 --> 00:04:10,540 Right? 102 00:04:10,540 --> 00:04:12,480 We're looking for these mechanisms that 103 00:04:12,480 --> 00:04:15,180 are present in network traffic that 104 00:04:15,180 --> 00:04:17,910 deviate so much from normal behavior 105 00:04:17,910 --> 00:04:20,579 that it arouses suspicion. 106 00:04:20,579 --> 00:04:23,800 And it could be for a variety of reasons. 107 00:04:23,800 --> 00:04:27,060 It could be somebody uploaded this cool video of somebody 108 00:04:27,060 --> 00:04:29,580 falling flat on their face, or it 109 00:04:29,580 --> 00:04:33,150 could be a botnet, or DDoS attack, or something like that. 110 00:04:33,150 --> 00:04:35,250 So outlier can be sometimes referred 111 00:04:35,250 --> 00:04:38,710 to as an anomaly, surprise, or exception. 112 00:04:38,710 --> 00:04:40,800 And within the context of cyber networks, 113 00:04:40,800 --> 00:04:43,920 these mechanisms can be botnets, C&C, or command 114 00:04:43,920 --> 00:04:47,970 and conquer servers, insider threats, or any other network 115 00:04:47,970 --> 00:04:51,030 attacks, such as distributed denial of service or port scan 116 00:04:51,030 --> 00:04:53,010 attacks. 117 00:04:53,010 --> 00:04:54,720 There are a number of general techniques 118 00:04:54,720 --> 00:04:56,670 to deal with outlier detection. 119 00:04:56,670 --> 00:04:58,800 So the first one is to look for changes-- 120 00:04:58,800 --> 00:05:00,570 is to essentially use statistics. 121 00:05:00,570 --> 00:05:03,780 You look at what's going on in your mechanism, 122 00:05:03,780 --> 00:05:07,230 and you look for statistical anomalies from that. 123 00:05:07,230 --> 00:05:09,330 And you'll highlight those anomalies, 124 00:05:09,330 --> 00:05:11,730 and then you'll kind of drive deep into that. 125 00:05:11,730 --> 00:05:12,780 You could do clustering. 126 00:05:12,780 --> 00:05:14,970 So if you're looking at unsupervised learning, 127 00:05:14,970 --> 00:05:16,600 you could cluster your data. 128 00:05:16,600 --> 00:05:18,430 And things that are really far away 129 00:05:18,430 --> 00:05:21,220 from existing clusters, or known clusters, 130 00:05:21,220 --> 00:05:24,070 are probably something that you want to take a look at. 131 00:05:24,070 --> 00:05:24,900 You could lose-- 132 00:05:24,900 --> 00:05:27,200 Similar to that is also distance-based techniques. 133 00:05:27,200 --> 00:05:29,530 So these are kind of closely related 134 00:05:29,530 --> 00:05:32,470 that you look for observations that very significantly 135 00:05:32,470 --> 00:05:36,290 from other observations in some sort of a feature space. 136 00:05:36,290 --> 00:05:38,560 And finally, you could use a model-based technique, 137 00:05:38,560 --> 00:05:42,760 where you attempt to model the behavior of normal-- 138 00:05:42,760 --> 00:05:44,950 right, and I use air quotes for the word normal-- 139 00:05:44,950 --> 00:05:47,380 and then, you come, and you look for things that deviate 140 00:05:47,380 --> 00:05:48,550 from this background model. 141 00:05:48,550 --> 00:05:50,290 So all of these four techniques are 142 00:05:50,290 --> 00:05:52,633 approaches to anomaly detection kind of related 143 00:05:52,633 --> 00:05:53,300 with each other. 144 00:05:53,300 --> 00:05:55,860 It's not a hard and fast rule about how 145 00:05:55,860 --> 00:05:57,550 they deviate from each other. 146 00:05:57,550 --> 00:06:01,460 But for the popularity and the complexity of network traffic, 147 00:06:01,460 --> 00:06:03,640 we decide to go with a model-based technique 148 00:06:03,640 --> 00:06:05,860 because we found that any other means of trying 149 00:06:05,860 --> 00:06:07,780 to represent the data would be difficult. 150 00:06:07,780 --> 00:06:10,030 But this is sort of a top down approach of how we kind 151 00:06:10,030 --> 00:06:12,130 of thought about, well, let's look at network traffic-- 152 00:06:12,130 --> 00:06:14,547 let's think about anomalies in network traffic-- and let's 153 00:06:14,547 --> 00:06:17,780 come up with what's a good approach to address that. 154 00:06:17,780 --> 00:06:19,540 So when we talk about network attacks-- 155 00:06:19,540 --> 00:06:21,780 not to go into detail-- 156 00:06:21,780 --> 00:06:25,620 there is a wide variety of network attacks out there. 157 00:06:25,620 --> 00:06:28,210 Every day we see news articles about new ones. 158 00:06:28,210 --> 00:06:31,540 The focus of this work was to look at just a couple of these. 159 00:06:31,540 --> 00:06:34,300 One was in this section that we call probing and scanning, 160 00:06:34,300 --> 00:06:37,900 and another was in this resource usage or resource utilization 161 00:06:37,900 --> 00:06:38,750 area. 162 00:06:38,750 --> 00:06:41,890 And specifically, we looked at port scanning and distributed 163 00:06:41,890 --> 00:06:44,170 denial of service attacks. 164 00:06:44,170 --> 00:06:47,260 As a quick example of what happens in one of these network 165 00:06:47,260 --> 00:06:50,290 attacks-- so this is an attack, often, within the bucket 166 00:06:50,290 --> 00:06:51,742 of probing and scanning. 167 00:06:51,742 --> 00:06:53,200 So essentially, what happens is you 168 00:06:53,200 --> 00:06:56,530 have an attacker that attempts to find out 169 00:06:56,530 --> 00:07:01,090 what ports are open on a victim or a target system. 170 00:07:01,090 --> 00:07:04,900 It does this by sending requests, similar to pings, 171 00:07:04,900 --> 00:07:09,010 to a victim or a target system. 172 00:07:09,010 --> 00:07:11,530 If the target's system acknowledges 173 00:07:11,530 --> 00:07:14,360 one of these things, you can say, oh, 174 00:07:14,360 --> 00:07:16,810 this particular port is open. 175 00:07:16,810 --> 00:07:20,710 And then, one may go and look at what software typically 176 00:07:20,710 --> 00:07:22,750 runs on those ports and attempt to use 177 00:07:22,750 --> 00:07:24,520 some of the known vulnerabilities 178 00:07:24,520 --> 00:07:26,150 of that software. 179 00:07:26,150 --> 00:07:29,290 So a lot of software packages have specific ports 180 00:07:29,290 --> 00:07:30,310 that they tend to use. 181 00:07:30,310 --> 00:07:33,250 So you can say, oh, you know, port-- 182 00:07:33,250 --> 00:07:36,100 I'm making up a number, here-- port 10,000 is open, 183 00:07:36,100 --> 00:07:41,020 and I know that's used by Microsoft SQL server 184 00:07:41,020 --> 00:07:42,575 or any other piece of software. 185 00:07:42,575 --> 00:07:44,950 And then you can say, well, here is a known vulnerability 186 00:07:44,950 --> 00:07:47,870 that I could use, and then I'll try attacking with that. 187 00:07:47,870 --> 00:07:49,870 So this is just a simple technique 188 00:07:49,870 --> 00:07:52,880 that a lot of attackers use, very low bar, very easy to do. 189 00:07:52,880 --> 00:07:55,270 You could write one by mistake. 190 00:07:55,270 --> 00:07:57,430 But just to find out what ports are open, 191 00:07:57,430 --> 00:07:59,755 and then try to use known vulnerabilities 192 00:07:59,755 --> 00:08:02,200 on different ports. 193 00:08:02,200 --> 00:08:04,030 Now, the least common denominator, 194 00:08:04,030 --> 00:08:06,880 the easiest piece of data to collect, 195 00:08:06,880 --> 00:08:08,560 is what's called the network packet. 196 00:08:08,560 --> 00:08:10,840 I'm sure many of you are familiar with the concept 197 00:08:10,840 --> 00:08:12,700 of a network packet. 198 00:08:12,700 --> 00:08:15,370 For those that are not, think of if you're sending a letter, 199 00:08:15,370 --> 00:08:17,770 it's all the material on the envelope. 200 00:08:17,770 --> 00:08:21,190 So it's the address, where it's coming from, who it's going to, 201 00:08:21,190 --> 00:08:23,530 plus a little bit of information of sort of what's 202 00:08:23,530 --> 00:08:26,200 in the package as well. 203 00:08:26,200 --> 00:08:28,968 The reason that we often use network packets-- 204 00:08:28,968 --> 00:08:30,760 and I'll kind of open this up to the class. 205 00:08:30,760 --> 00:08:32,590 A lot of the cyber research actually 206 00:08:32,590 --> 00:08:36,010 focuses on network packets, but those often 207 00:08:36,010 --> 00:08:38,890 form a very small percentage of the actual data, the payloads, 208 00:08:38,890 --> 00:08:41,500 where actually a lot of the data actually is. 209 00:08:41,500 --> 00:08:44,080 Can anyone guess why we tend to use a network packets rather 210 00:08:44,080 --> 00:08:45,640 than payloads? 211 00:08:45,640 --> 00:08:46,240 Not Albert? 212 00:08:46,240 --> 00:08:47,470 AUDIENCE: Header. 213 00:08:47,470 --> 00:08:49,262 VIJAY GADEPALLY: Sorry, the network header. 214 00:08:49,262 --> 00:08:52,590 Yeah, sorry the network header, not the packet, yeah. 215 00:08:52,590 --> 00:08:53,798 So, Yes. 216 00:08:53,798 --> 00:08:55,840 Anyone guess why we tend to use the header rather 217 00:08:55,840 --> 00:08:58,807 than the payload of a packet? 218 00:08:58,807 --> 00:08:59,640 AUDIENCE: Encrypted. 219 00:08:59,640 --> 00:09:01,432 VIJAY GADEPALLY: Yeah, encryption, exactly. 220 00:09:01,432 --> 00:09:07,060 So very often, especially these days, the payload of the packet 221 00:09:07,060 --> 00:09:09,580 is typically encrypted, so there's not too much 222 00:09:09,580 --> 00:09:10,880 that you can do with it. 223 00:09:10,880 --> 00:09:12,730 So using the header information-- 224 00:09:12,730 --> 00:09:13,810 the header is not, right? 225 00:09:13,810 --> 00:09:15,292 It's the outside of the envelope. 226 00:09:15,292 --> 00:09:17,750 The inside of the envelope is something that you cannot see 227 00:09:17,750 --> 00:09:20,140 by putting it up to the light. 228 00:09:20,140 --> 00:09:23,060 So that's why we tend to use the header of the packet. 229 00:09:23,060 --> 00:09:27,320 So the headers-- so this is what a packet looks like, 230 00:09:27,320 --> 00:09:30,010 on the left. 231 00:09:30,010 --> 00:09:32,170 On the left side, that's what a packet looks like. 232 00:09:32,170 --> 00:09:34,378 And if you kind of convert that into a human readable 233 00:09:34,378 --> 00:09:36,240 form, just the header information, 234 00:09:36,240 --> 00:09:38,240 this is the type of data that you get out of it. 235 00:09:38,240 --> 00:09:40,690 So it gives you things that you would expect. 236 00:09:40,690 --> 00:09:42,880 What was the IP address that I started from? 237 00:09:42,880 --> 00:09:44,710 What was a port that I started from? 238 00:09:44,710 --> 00:09:46,870 What was the destination IP address? 239 00:09:46,870 --> 00:09:49,660 What was the destination port? 240 00:09:49,660 --> 00:09:51,070 Plus a bunch of other information 241 00:09:51,070 --> 00:09:53,350 about certain flags that were set, whether it's 242 00:09:53,350 --> 00:09:54,670 what direction to flow-- 243 00:09:54,670 --> 00:09:56,440 what direction the traffic is moving in 244 00:09:56,440 --> 00:09:59,530 and stuff like that, and then any flags associated with it, 245 00:09:59,530 --> 00:10:01,960 what type of packet it is, so what type of protocol 246 00:10:01,960 --> 00:10:04,480 is the packet using, et cetera? 247 00:10:04,480 --> 00:10:06,730 All right, so just as a reminder, the data we're using 248 00:10:06,730 --> 00:10:08,830 is from the MAWI working group, but there 249 00:10:08,830 --> 00:10:11,500 is a lot of other data out there that you might 250 00:10:11,500 --> 00:10:13,640 be able to get your hands on. 251 00:10:13,640 --> 00:10:15,580 So one of the challenges, how many people here 252 00:10:15,580 --> 00:10:17,650 have worked in cybersecurity or done 253 00:10:17,650 --> 00:10:20,122 some research in cybersecurity? 254 00:10:20,122 --> 00:10:21,790 So a handful of people. 255 00:10:21,790 --> 00:10:24,700 So one of the huge challenges-- this is something that you may 256 00:10:24,700 --> 00:10:27,820 run into in your data sets-- 257 00:10:27,820 --> 00:10:31,380 is ground truth, is understanding when-- 258 00:10:31,380 --> 00:10:33,130 So if you're trying to find out when there 259 00:10:33,130 --> 00:10:34,828 was an attack in your data, someone 260 00:10:34,828 --> 00:10:36,370 needs to have-- you know, there needs 261 00:10:36,370 --> 00:10:38,652 to be some ground truth associated with that. 262 00:10:38,652 --> 00:10:40,360 As much as we'd all love to sit and write 263 00:10:40,360 --> 00:10:42,190 DDoS attacks and port scan attacks 264 00:10:42,190 --> 00:10:45,850 on ourselves, sometimes, you know, getting that at scale 265 00:10:45,850 --> 00:10:47,300 can be very difficult. 266 00:10:47,300 --> 00:10:48,760 So for the purpose of this work, we 267 00:10:48,760 --> 00:10:51,070 used a synthetic data generator, in which 268 00:10:51,070 --> 00:10:54,040 we took real internet traffic data 269 00:10:54,040 --> 00:10:57,280 but then injected synthetic attacks into it. 270 00:10:57,280 --> 00:11:00,850 And that was sort of our way of looking for-- of establishing 271 00:11:00,850 --> 00:11:01,940 ground truth. 272 00:11:01,940 --> 00:11:03,430 So the MAWI working group actually 273 00:11:03,430 --> 00:11:07,035 does published ground truth based on a number of detectors 274 00:11:07,035 --> 00:11:07,660 that they have. 275 00:11:07,660 --> 00:11:10,420 A lot of these are heuristics-based detectors. 276 00:11:10,420 --> 00:11:14,150 In our use and in our looking at that ground truth, 277 00:11:14,150 --> 00:11:16,480 it was really difficult to actually understand 278 00:11:16,480 --> 00:11:18,760 when an attack was occurring. 279 00:11:18,760 --> 00:11:20,920 It was sort of like, somewhere in this hour 280 00:11:20,920 --> 00:11:22,060 an attack occurred. 281 00:11:22,060 --> 00:11:24,643 Which, if you're trying to train a machine learning classifier 282 00:11:24,643 --> 00:11:27,430 on that or an anomaly detector, that's kind of vague. 283 00:11:27,430 --> 00:11:31,010 Because there's a lot happening in an hour of internet traffic, 284 00:11:31,010 --> 00:11:33,782 and so if it's somewhere in that hour an attack occurred, 285 00:11:33,782 --> 00:11:35,240 that can be very difficult to find. 286 00:11:35,240 --> 00:11:36,657 So that's sort of a reason that we 287 00:11:36,657 --> 00:11:38,540 focus on using a synthetic attack generator. 288 00:11:38,540 --> 00:11:39,540 Yeah? 289 00:11:39,540 --> 00:11:41,165 AUDIENCE: Can you say a little bit more 290 00:11:41,165 --> 00:11:45,640 about what kind of data gets contracted? 291 00:11:45,640 --> 00:11:47,710 You know, is it just additive, where 292 00:11:47,710 --> 00:11:52,070 you're showing some new compromised 293 00:11:52,070 --> 00:11:55,950 post that is injecting data to some destination? 294 00:11:55,950 --> 00:11:58,445 Is there also, like, return flows that are in there? 295 00:11:58,445 --> 00:11:59,320 VIJAY GADEPALLY: Yep. 296 00:11:59,320 --> 00:11:59,980 AUDIENCE: OK. 297 00:11:59,980 --> 00:12:02,188 VIJAY GADEPALLY: Yeah, so the question that was asked 298 00:12:02,188 --> 00:12:05,080 is, you know, I guess, how sophisticated is this tool, 299 00:12:05,080 --> 00:12:07,220 and how detailed can you do it? 300 00:12:07,220 --> 00:12:10,540 So the exact details of that, I would probably forward you 301 00:12:10,540 --> 00:12:15,010 to Emily, but from looking at what was being done, 302 00:12:15,010 --> 00:12:16,510 these are reasonably sophisticated. 303 00:12:16,510 --> 00:12:17,968 So it's not as simple as, you know, 304 00:12:17,968 --> 00:12:20,840 you have to give it a text file, all the IP addresses, 305 00:12:20,840 --> 00:12:22,395 source and-- 306 00:12:22,395 --> 00:12:24,020 But you do have to set some parameters. 307 00:12:24,020 --> 00:12:26,920 So what you can do, for example, if it's a DDoS attack-- right, 308 00:12:26,920 --> 00:12:29,140 that's maybe the easiest one to describe-- 309 00:12:29,140 --> 00:12:32,740 is you would give maybe a range of IP addresses and ports 310 00:12:32,740 --> 00:12:36,640 that you'd like to see injected, you'd give it a flow rate, 311 00:12:36,640 --> 00:12:38,633 and then you could tell it what type of packets 312 00:12:38,633 --> 00:12:39,550 you'd like introduced. 313 00:12:39,550 --> 00:12:43,210 And it sort of allows you to do a number of these different-- 314 00:12:43,210 --> 00:12:44,830 this particular tool at least-- 315 00:12:44,830 --> 00:12:47,380 allows you to pick a number of these different features 316 00:12:47,380 --> 00:12:51,298 that it then injects into the original pcap file. 317 00:12:51,298 --> 00:12:53,340 If you're looking for some of the other attacks-- 318 00:12:53,340 --> 00:12:55,330 so this is just a list of the attacks 319 00:12:55,330 --> 00:12:58,810 that they currently support, as of a few months ago, 320 00:12:58,810 --> 00:13:01,110 and they might be adding to this. 321 00:13:01,110 --> 00:13:04,222 You know, we focus on a few of these that were also 322 00:13:04,222 --> 00:13:05,680 easy for us to reason and, kind of, 323 00:13:05,680 --> 00:13:07,480 go back into the actual data that we 324 00:13:07,480 --> 00:13:11,590 collected to see if the actual injection made sense to us 325 00:13:11,590 --> 00:13:12,422 or not. 326 00:13:12,422 --> 00:13:14,380 But we do have a number of different parameters 327 00:13:14,380 --> 00:13:18,770 that you could pick when you're doing that. 328 00:13:18,770 --> 00:13:19,978 So that's just a little bit-- 329 00:13:19,978 --> 00:13:22,228 You know, as we talked about, in the data conditioning 330 00:13:22,228 --> 00:13:23,770 phase of the labeling data, is often 331 00:13:23,770 --> 00:13:25,630 a pretty difficult process. 332 00:13:25,630 --> 00:13:28,150 In the world of cyber anomaly, in the world 333 00:13:28,150 --> 00:13:30,820 of network packets and network security, 334 00:13:30,820 --> 00:13:34,360 going back and trying to label is an extremely time consuming 335 00:13:34,360 --> 00:13:37,300 task and very difficult and up to a lot of interpretation 336 00:13:37,300 --> 00:13:40,270 as to what one calls an attack. 337 00:13:40,270 --> 00:13:43,810 So while we did manually inspect a few of these, 338 00:13:43,810 --> 00:13:45,260 and that's certainly not scalable. 339 00:13:45,260 --> 00:13:47,630 So we decided to go with a synthetic attack generator. 340 00:13:47,630 --> 00:13:49,255 And I'm sure for some of your projects, 341 00:13:49,255 --> 00:13:51,700 you may also kind of think about is that 342 00:13:51,700 --> 00:13:56,480 a route that you need to take, at least to start off with. 343 00:13:56,480 --> 00:13:59,220 OK, so this is the overall pipeline of the system. 344 00:13:59,220 --> 00:14:01,730 So as I mentioned the first time we chatted, 345 00:14:01,730 --> 00:14:05,780 we spent a majority of our time in the first step here, 346 00:14:05,780 --> 00:14:10,460 which is data conditioning. 347 00:14:10,460 --> 00:14:12,953 So within data conditioning, we did, you know, 348 00:14:12,953 --> 00:14:14,870 taking the raw data, which is downloading this 349 00:14:14,870 --> 00:14:16,310 from the internet. 350 00:14:16,310 --> 00:14:19,340 This comes to you in a binary format, which is a pcap. 351 00:14:19,340 --> 00:14:22,060 For those who are familiar with packet capture outputs, that's 352 00:14:22,060 --> 00:14:23,540 that binary format that comes from, 353 00:14:23,540 --> 00:14:25,820 like, tcpdump, for example. 354 00:14:25,820 --> 00:14:29,080 We then convert that pcap file into flows, 355 00:14:29,080 --> 00:14:31,700 parse that into human readable form, 356 00:14:31,700 --> 00:14:35,030 feature engineer, training machine learning classifier, 357 00:14:35,030 --> 00:14:37,310 do the same thing with the synthetic data, 358 00:14:37,310 --> 00:14:40,090 and then get good results. 359 00:14:40,090 --> 00:14:43,040 Or we hope that you get good results. 360 00:14:43,040 --> 00:14:44,960 So step one of data conditioning was 361 00:14:44,960 --> 00:14:47,420 to download the actual raw data. 362 00:14:47,420 --> 00:14:49,820 The data was downloaded in a compressed binary 363 00:14:49,820 --> 00:14:51,800 format, which is probably what a lot of you 364 00:14:51,800 --> 00:14:52,985 will get your hands on. 365 00:14:52,985 --> 00:14:57,305 A typical size of a single zip file or typical compressed file 366 00:14:57,305 --> 00:14:59,390 is about two and a half gig. 367 00:14:59,390 --> 00:15:02,060 When you decompress that, it goes to about 10 gigs, 368 00:15:02,060 --> 00:15:03,290 so about 4 x. 369 00:15:03,290 --> 00:15:06,050 And a single pcap file corresponds 370 00:15:06,050 --> 00:15:08,690 to-- so it's about 10 gigs for about 15 minutes 371 00:15:08,690 --> 00:15:10,173 worth of traffic. 372 00:15:10,173 --> 00:15:11,840 So if you're trying to do multiple days, 373 00:15:11,840 --> 00:15:15,320 you can imagine how this starts to explode in terms of volume. 374 00:15:15,320 --> 00:15:18,560 And about 15 minutes, on average, 375 00:15:18,560 --> 00:15:20,420 is about 150,000 pack-- 376 00:15:20,420 --> 00:15:23,573 it corresponds to about 150,000 packets per second. 377 00:15:23,573 --> 00:15:25,490 So this a reasonable amount of network traffic 378 00:15:25,490 --> 00:15:27,740 that we're getting our hands on, but not even 379 00:15:27,740 --> 00:15:30,320 close to the types of volume that large scale providers 380 00:15:30,320 --> 00:15:32,540 have to deal with on a regular basis. 381 00:15:32,540 --> 00:15:34,590 Once we have the actual-- 382 00:15:34,590 --> 00:15:37,760 once we have the pcap file downloaded, 383 00:15:37,760 --> 00:15:40,100 just the network packets don't really tell us much. 384 00:15:40,100 --> 00:15:41,498 It's stuff to do things with. 385 00:15:41,498 --> 00:15:43,040 So what we want to do is convert this 386 00:15:43,040 --> 00:15:46,520 into a representation that's useful to do analysis. 387 00:15:46,520 --> 00:15:49,940 And so a lot of work has been done using network flows. 388 00:15:49,940 --> 00:15:51,530 And a network flow is, essentially, 389 00:15:51,530 --> 00:15:55,250 a sequence of packets from a single source to destination. 390 00:15:55,250 --> 00:15:57,740 So as you can imagine, when you're streaming a YouTube 391 00:15:57,740 --> 00:16:01,370 video, for example, it's not one big packet that comes to you, 392 00:16:01,370 --> 00:16:03,650 but it's a series of packets. 393 00:16:03,650 --> 00:16:06,680 And so flow essentially tries to detect that 394 00:16:06,680 --> 00:16:09,230 and say, OK, all of these things was 395 00:16:09,230 --> 00:16:10,490 a person watching this video. 396 00:16:10,490 --> 00:16:10,990 Right? 397 00:16:10,990 --> 00:16:13,580 So from this source to this destination within some time 398 00:16:13,580 --> 00:16:16,350 window is defined as a network flow. 399 00:16:16,350 --> 00:16:19,730 And so we convert each of these 15 minute pcap files 400 00:16:19,730 --> 00:16:22,580 into network flow representation using a tool 401 00:16:22,580 --> 00:16:26,460 called yaf, which stands for yet another flow meter. 402 00:16:26,460 --> 00:16:29,780 And so that helps us do some of this flow-- 403 00:16:29,780 --> 00:16:31,800 establishing these network flows. 404 00:16:31,800 --> 00:16:34,760 And so the size of about 15 minutes worth of flows 405 00:16:34,760 --> 00:16:39,470 goes to about two gigabytes, and we set a time out of flow 406 00:16:39,470 --> 00:16:41,990 to be about five minutes. 407 00:16:41,990 --> 00:16:43,970 OK, once we have the flows, they're 408 00:16:43,970 --> 00:16:45,740 still in this binary format. 409 00:16:45,740 --> 00:16:48,830 We need to now convert them into a representation 410 00:16:48,830 --> 00:16:49,880 that we can look at. 411 00:16:49,880 --> 00:16:53,480 So flow comes with an ASCII converter called the yafscii-- 412 00:16:53,480 --> 00:16:57,920 Sorry, yaf comes with the ASCII converter called yafscii, 413 00:16:57,920 --> 00:17:01,370 and that essentially converts the data from this floor 414 00:17:01,370 --> 00:17:03,920 representation into a tabular form. 415 00:17:03,920 --> 00:17:05,569 And if there's one lesson that I think 416 00:17:05,569 --> 00:17:07,369 Jeremy will talk to you a lot about next, 417 00:17:07,369 --> 00:17:09,109 it is tables, tables, tables. 418 00:17:09,109 --> 00:17:11,220 People love looking at tabular data, 419 00:17:11,220 --> 00:17:13,010 it's very easy to understand. 420 00:17:13,010 --> 00:17:15,960 I wouldn't recommend it, but you could open this up in Excel. 421 00:17:15,960 --> 00:17:19,069 It would be huge, but it would open up. 422 00:17:19,069 --> 00:17:21,361 But it's very easy to, kind of, look at, 423 00:17:21,361 --> 00:17:23,819 especially when we're trying to establish truth in the data 424 00:17:23,819 --> 00:17:25,700 and go back and take a look at it. 425 00:17:25,700 --> 00:17:28,930 So for our pipeline, we take the yaf files, 426 00:17:28,930 --> 00:17:30,548 convert them into text files. 427 00:17:30,548 --> 00:17:32,090 Each of these ASCII tables, so again, 428 00:17:32,090 --> 00:17:34,430 about 15 minutes worth of data, is about eight gigabytes 429 00:17:34,430 --> 00:17:36,050 in size. 430 00:17:36,050 --> 00:17:39,420 And we record a number of different data pieces per flow. 431 00:17:39,420 --> 00:17:41,670 I won't go into the details, but if you're interested, 432 00:17:41,670 --> 00:17:45,800 you can kind of look at what are the various entities 433 00:17:45,800 --> 00:17:49,640 or features that come up in a network flow? 434 00:17:49,640 --> 00:17:51,830 So now that we have network flows, 435 00:17:51,830 --> 00:17:55,550 that's sort of like part one of data conditioning, right? 436 00:17:55,550 --> 00:17:57,410 We've converted it into some representation 437 00:17:57,410 --> 00:17:58,370 that makes sense. 438 00:17:58,370 --> 00:18:00,320 We have it in human readable form. 439 00:18:00,320 --> 00:18:02,300 So, step one data conditioning done. 440 00:18:02,300 --> 00:18:03,800 Now, the next thing is to convert it 441 00:18:03,800 --> 00:18:05,540 into something that actually makes sense 442 00:18:05,540 --> 00:18:07,470 for our machine learning algorithm. 443 00:18:07,470 --> 00:18:10,100 And so this is sort of bucketing into the area of feature 444 00:18:10,100 --> 00:18:12,620 engineering, when we kind of think back at it. 445 00:18:12,620 --> 00:18:16,470 And so feature engineering is really, really important. 446 00:18:16,470 --> 00:18:19,560 You'll spend a lot of time doing this on new data sets. 447 00:18:19,560 --> 00:18:21,230 And so, each of these flows contains 448 00:18:21,230 --> 00:18:23,220 about 21 different features. 449 00:18:23,220 --> 00:18:24,260 And one of the-- 450 00:18:24,260 --> 00:18:26,690 we need to look for essentially which of these features 451 00:18:26,690 --> 00:18:29,180 makes sense for us to pass into a machine learning model. 452 00:18:29,180 --> 00:18:31,680 Others we're just going to end up training on a [? noisy. ?] 453 00:18:31,680 --> 00:18:33,710 So we have to use some domain knowledge, right, 454 00:18:33,710 --> 00:18:36,110 where we talk to experts in cybersecurity 455 00:18:36,110 --> 00:18:38,420 and say, well, you know, IP address is kind 456 00:18:38,420 --> 00:18:40,910 of an important thing, but maybe this other flag is not 457 00:18:40,910 --> 00:18:41,695 as important. 458 00:18:41,695 --> 00:18:43,820 And then, we use a lot of trial and error and luck, 459 00:18:43,820 --> 00:18:46,860 as well, to help pick some of these features. 460 00:18:46,860 --> 00:18:48,297 Once we have these-- 461 00:18:48,297 --> 00:18:49,880 Once we kind of pick a set of features 462 00:18:49,880 --> 00:18:52,070 that we're interested in, when we look back 463 00:18:52,070 --> 00:18:54,530 at the anomaly detection literature, a lot of work 464 00:18:54,530 --> 00:18:57,135 has been done with this concept of, 465 00:18:57,135 --> 00:18:58,760 you know, you have these flows, but you 466 00:18:58,760 --> 00:19:02,750 need to convert them into some other format that makes sense 467 00:19:02,750 --> 00:19:03,830 to look for anomalies. 468 00:19:03,830 --> 00:19:05,750 So it's just converting into a feature space 469 00:19:05,750 --> 00:19:07,710 in which anomalies make sense. 470 00:19:07,710 --> 00:19:10,490 And so we looked at using entropy. 471 00:19:10,490 --> 00:19:12,890 And the basic intuition behind this 472 00:19:12,890 --> 00:19:17,030 is that if we're looking at various flows 473 00:19:17,030 --> 00:19:20,960 between different fields and that the entropy of these flows 474 00:19:20,960 --> 00:19:22,940 should be roughly equal-- 475 00:19:22,940 --> 00:19:24,440 should be roughly similar, should 476 00:19:24,440 --> 00:19:26,270 be about the same, not equal-- but should 477 00:19:26,270 --> 00:19:30,890 be unchanging when there is no big mechanism, defined 478 00:19:30,890 --> 00:19:33,020 by an outlier detection mechanism, 479 00:19:33,020 --> 00:19:35,130 that's involved with changing the entropy. 480 00:19:35,130 --> 00:19:38,360 So for example, just to make this more clear, 481 00:19:38,360 --> 00:19:40,640 if you're having a DDoS attack, this 482 00:19:40,640 --> 00:19:43,790 is typically a number of different source IP 483 00:19:43,790 --> 00:19:47,120 addresses attempting to talk to a single destination IP 484 00:19:47,120 --> 00:19:48,000 address. 485 00:19:48,000 --> 00:19:50,150 So you would expect to see an increase 486 00:19:50,150 --> 00:19:53,660 in the entropy of the source IP field 487 00:19:53,660 --> 00:19:55,127 in that particular example. 488 00:19:55,127 --> 00:19:56,960 And in port scan attacks, you would probably 489 00:19:56,960 --> 00:20:00,363 see something on the destination port entropy would go up. 490 00:20:00,363 --> 00:20:02,030 Right, that's a little bit of intuition, 491 00:20:02,030 --> 00:20:03,905 it's a little bit more complicated than that. 492 00:20:03,905 --> 00:20:06,330 But that's a very high level view of what's going on. 493 00:20:06,330 --> 00:20:08,630 And for entropy, we just use the standard-- 494 00:20:08,630 --> 00:20:10,880 I'm sure many of you who are familiar with information 495 00:20:10,880 --> 00:20:13,640 theory are very familiar with Shannon entropy. 496 00:20:13,640 --> 00:20:15,660 And we compute the flow associated 497 00:20:15,660 --> 00:20:19,200 with each of the features that we pinged from the network flow 498 00:20:19,200 --> 00:20:19,700 before. 499 00:20:19,700 --> 00:20:21,158 So from the feature engineering, we 500 00:20:21,158 --> 00:20:23,540 picked a subset of features associated with that. 501 00:20:23,540 --> 00:20:25,625 We all remember when neural networks are. 502 00:20:25,625 --> 00:20:28,140 This is just there for completeness. 503 00:20:28,140 --> 00:20:30,050 So we take these now, network flow 504 00:20:30,050 --> 00:20:34,010 sees these entropies associated with the network flow, 505 00:20:34,010 --> 00:20:38,900 we use those as input features for a neural network model. 506 00:20:38,900 --> 00:20:41,840 Now, you'll recall from the neural network talk 507 00:20:41,840 --> 00:20:46,010 that we had last week, we have inputs and outputs, 508 00:20:46,010 --> 00:20:47,825 we have weights associated with them, 509 00:20:47,825 --> 00:20:49,700 and then we have this nonlinear equation that 510 00:20:49,700 --> 00:20:53,598 sort of represents inputs and outputs at each of the layers. 511 00:20:53,598 --> 00:20:54,515 That was the equation. 512 00:20:54,515 --> 00:20:56,810 I could see Barry giving me the cue that I'm 513 00:20:56,810 --> 00:20:57,935 walking towards the slides. 514 00:20:57,935 --> 00:20:58,602 AUDIENCE: Oh no. 515 00:21:00,740 --> 00:21:05,030 VIJAY GADEPALLY: So the features that we have over here 516 00:21:05,030 --> 00:21:07,290 correspond to the various entropies 517 00:21:07,290 --> 00:21:09,050 that we've computed in the previous step. 518 00:21:09,050 --> 00:21:11,210 And the outputs of this network correspond 519 00:21:11,210 --> 00:21:13,190 to a class of attack. 520 00:21:13,190 --> 00:21:15,830 So in particular, we focused on, obviously, 521 00:21:15,830 --> 00:21:19,790 no attack, DDoS attacks, port scans, network scans, 522 00:21:19,790 --> 00:21:22,430 and P2P network scans. 523 00:21:22,430 --> 00:21:26,440 The model itself had three hidden layers with 130 524 00:21:26,440 --> 00:21:29,870 and 100 nodes, respectively, and an output layer. 525 00:21:29,870 --> 00:21:31,760 And we used ReLU activation. 526 00:21:31,760 --> 00:21:34,400 I'm happy to talk in more depth about why we selected these, 527 00:21:34,400 --> 00:21:37,310 but I'll save that for another time. 528 00:21:37,310 --> 00:21:39,710 And going through this process, we actually 529 00:21:39,710 --> 00:21:41,430 had very good results. 530 00:21:41,430 --> 00:21:45,660 So that's sort of the short form of the research work. 531 00:21:45,660 --> 00:21:47,420 But what I wanted to really emphasize 532 00:21:47,420 --> 00:21:50,060 is the amount of effort that went into the data 533 00:21:50,060 --> 00:21:53,690 conditioning, the importance of collecting data, anonymizing 534 00:21:53,690 --> 00:21:57,650 data, making it human readable, converting it into a format 535 00:21:57,650 --> 00:22:01,060 that people understand, and then kind of walking 536 00:22:01,060 --> 00:22:03,560 through the sort of-- it was certainly an iterative process. 537 00:22:03,560 --> 00:22:06,200 As much as this is a very short form of what we did, 538 00:22:06,200 --> 00:22:09,050 this took a long time to get to these results. 539 00:22:09,050 --> 00:22:10,700 And what we're presenting over here 540 00:22:10,700 --> 00:22:12,940 is the attack type and the intensity. 541 00:22:12,940 --> 00:22:16,100 So this is the ground truth from the synthetic data generator, 542 00:22:16,100 --> 00:22:18,440 and then this is sort of the prediction 543 00:22:18,440 --> 00:22:20,478 that our model has across the site. 544 00:22:20,478 --> 00:22:21,770 And this is a confusion matrix. 545 00:22:21,770 --> 00:22:24,530 So a higher number or a darker blue 546 00:22:24,530 --> 00:22:26,700 indicates that we did a really good job. 547 00:22:26,700 --> 00:22:29,540 And we've also used varying intensities 548 00:22:29,540 --> 00:22:30,560 of these in attacks. 549 00:22:30,560 --> 00:22:32,600 So certainly, if we have strong, which 550 00:22:32,600 --> 00:22:35,540 means we've injected about 20,000 packets in the 15 551 00:22:35,540 --> 00:22:39,950 minutes, we can detect those really, really well. 552 00:22:39,950 --> 00:22:42,500 As you have weaker and weaker attacks, 553 00:22:42,500 --> 00:22:46,460 our system doesn't do as well. it's quite reasonable, 554 00:22:46,460 --> 00:22:50,430 but certainly an area of ongoing research. 555 00:22:50,430 --> 00:22:53,750 So with that, I just wanted to hand it over to Jeremy. 556 00:22:53,750 --> 00:22:56,690 But these are some initial work and results 557 00:22:56,690 --> 00:22:58,640 of detecting and classifying network attacks. 558 00:22:58,640 --> 00:23:00,830 The keynotes was-- data conditioning 559 00:23:00,830 --> 00:23:03,890 was where we ended up spending the majority of the time. 560 00:23:03,890 --> 00:23:06,560 Cleaning up the collected data, coming up with-- 561 00:23:06,560 --> 00:23:09,290 We spent a lot of time trying to figure out 562 00:23:09,290 --> 00:23:12,170 how to label this data because we were just 563 00:23:12,170 --> 00:23:13,400 unable to really do it. 564 00:23:13,400 --> 00:23:16,670 And finally, we ended up going the synthetic generator path. 565 00:23:16,670 --> 00:23:19,350 But we actually did spend a significant amount of time 566 00:23:19,350 --> 00:23:20,640 to try to figure that out. 567 00:23:20,640 --> 00:23:22,130 And then, we spent a lot of time and feature 568 00:23:22,130 --> 00:23:23,600 in engineering, which did consist 569 00:23:23,600 --> 00:23:26,920 of a lot of trial and error and a lot of that stuff like that. 570 00:23:26,920 --> 00:23:29,930 OK, thank you. 571 00:23:29,930 --> 00:23:34,210 JEREMY KEPNER: OK, thank you very much. 572 00:23:34,210 --> 00:23:37,040 I'm Jeremy Kepner from the Lincoln Laboratory 573 00:23:37,040 --> 00:23:38,400 Supercomputing Center. 574 00:23:38,400 --> 00:23:41,140 I think most of you know me already. 575 00:23:41,140 --> 00:23:45,350 I'm just really introducing myself for video. 576 00:23:45,350 --> 00:23:49,790 And this has been his topic has been teed up really 577 00:23:49,790 --> 00:23:52,340 well by Sid and Vijay. 578 00:23:52,340 --> 00:23:56,540 I think they illustrated, sort of, going from-- 579 00:23:56,540 --> 00:24:04,590 if we think about our pipeline going from left to right. 580 00:24:04,590 --> 00:24:06,440 So they've talked about, kind of, 581 00:24:06,440 --> 00:24:08,480 the end stage of what you're trying to do 582 00:24:08,480 --> 00:24:10,280 and building these models. 583 00:24:10,280 --> 00:24:13,820 Vijay talked about the data that feeds into that. 584 00:24:13,820 --> 00:24:16,010 I'm going to talk about the data architecture 585 00:24:16,010 --> 00:24:19,200 that you go through to kind of build up to that. 586 00:24:19,200 --> 00:24:22,130 And the reason we've done it in this order, which 587 00:24:22,130 --> 00:24:27,680 is the reverse order that you do it in real life, 588 00:24:27,680 --> 00:24:30,710 is because it's difficult to appreciate the data 589 00:24:30,710 --> 00:24:33,950 architecture piece without an understanding of where you're 590 00:24:33,950 --> 00:24:35,360 going. 591 00:24:35,360 --> 00:24:39,440 And so if you see from Sid's talk, 592 00:24:39,440 --> 00:24:43,460 I think it became very clear that having lots of data 593 00:24:43,460 --> 00:24:46,640 is really important for creating these models. 594 00:24:46,640 --> 00:24:51,470 Without the data, you can't do anything. 595 00:24:51,470 --> 00:24:55,040 And likewise, you didn't see it, but all that data 596 00:24:55,040 --> 00:24:58,640 was created in a very precise form that made it 597 00:24:58,640 --> 00:25:02,300 easy to do that kind of work. 598 00:25:02,300 --> 00:25:04,640 And then, Vijay talked a little about some of things, 599 00:25:04,640 --> 00:25:06,890 given a data set what did he have 600 00:25:06,890 --> 00:25:13,550 to do to get in and form such that it would work for his AI 601 00:25:13,550 --> 00:25:15,500 training application. 602 00:25:15,500 --> 00:25:19,820 And what Vijay is doing and did in that application, 603 00:25:19,820 --> 00:25:23,450 is very much a sort of microcosm of which many of you 604 00:25:23,450 --> 00:25:25,640 will be doing in your own projects. 605 00:25:25,640 --> 00:25:27,740 You will try to identify data sets, 606 00:25:27,740 --> 00:25:30,230 you will try and do featuring engineering, 607 00:25:30,230 --> 00:25:32,390 and then go through this sort of trial and error 608 00:25:32,390 --> 00:25:37,190 process of coming up with AI models that suit some purpose. 609 00:25:37,190 --> 00:25:41,330 And a key part of that is what we call AI data architecture. 610 00:25:41,330 --> 00:25:44,450 So I'm going to talk about some of the things 611 00:25:44,450 --> 00:25:48,230 that we've learned over the last 10 years in data architecture. 612 00:25:48,230 --> 00:25:50,420 As a head of the supercomputing center, 613 00:25:50,420 --> 00:25:53,120 I have the privilege of working with, really, 614 00:25:53,120 --> 00:25:57,410 everyone at MIT on all their data analysis problems, 615 00:25:57,410 --> 00:26:01,070 and over the years, we've learned some techniques, 616 00:26:01,070 --> 00:26:05,000 some really simple techniques, that can dramatically 617 00:26:05,000 --> 00:26:08,900 make data architecture for your applications a lot simpler. 618 00:26:08,900 --> 00:26:10,910 There's a lot of people out there 619 00:26:10,910 --> 00:26:14,870 selling very fancy software to solve data architecture 620 00:26:14,870 --> 00:26:16,220 problems. 621 00:26:16,220 --> 00:26:20,000 We have found that we can use very simple techniques 622 00:26:20,000 --> 00:26:22,970 to solve most of those problems using software 623 00:26:22,970 --> 00:26:26,610 that's built into every single computer that is made today. 624 00:26:26,610 --> 00:26:29,240 That is, you don't need to go and download 625 00:26:29,240 --> 00:26:31,520 some new tool to do any of the things I'm 626 00:26:31,520 --> 00:26:34,200 going to talk about right now. 627 00:26:34,200 --> 00:26:36,810 So with that introduction, I'm going to just go over 628 00:26:36,810 --> 00:26:38,570 this the outline of my talk. 629 00:26:38,570 --> 00:26:41,120 I'm going to add some introduction and motivation, 630 00:26:41,120 --> 00:26:43,370 which I'll go through quickly because I think 631 00:26:43,370 --> 00:26:46,280 Sid and Vijay did a fantastic job of motivating what we're 632 00:26:46,280 --> 00:26:47,000 doing. 633 00:26:47,000 --> 00:26:51,380 And then, I'm going to talk about our three big ideas 634 00:26:51,380 --> 00:26:54,710 and how to make AI data architecture 635 00:26:54,710 --> 00:26:59,250 simple and effective across a wide range of applications. 636 00:26:59,250 --> 00:27:02,360 One is, and as Vijay talked about, 637 00:27:02,360 --> 00:27:06,350 we're going to really hit this concept of tabular data. 638 00:27:06,350 --> 00:27:10,280 There's a reason why tabular data is so good 639 00:27:10,280 --> 00:27:12,950 and why it is such a preferred format. 640 00:27:12,950 --> 00:27:15,260 And so, you want to try and get your data 641 00:27:15,260 --> 00:27:17,280 into these tabular formats. 642 00:27:17,280 --> 00:27:21,150 And the nice thing is that, if that sounds really simple, 643 00:27:21,150 --> 00:27:24,380 it's because it is really simple. 644 00:27:24,380 --> 00:27:26,120 Lots of people, though, want you to buy 645 00:27:26,120 --> 00:27:28,610 tools that make this process complicated 646 00:27:28,610 --> 00:27:31,250 because that's how they inject themselves in the process. 647 00:27:31,250 --> 00:27:32,900 And we've discovered, actually, you 648 00:27:32,900 --> 00:27:37,250 can do world class MIT quality AI work with really, really 649 00:27:37,250 --> 00:27:38,600 simple tools. 650 00:27:38,600 --> 00:27:41,990 Getting things in a tabular form is a great starting point. 651 00:27:41,990 --> 00:27:44,760 And then, how to organize the data, 652 00:27:44,760 --> 00:27:48,140 how to organize the data in the pipeline that Vijay showed, 653 00:27:48,140 --> 00:27:51,200 which is pretty standard across a lot of applications. 654 00:27:51,200 --> 00:27:53,270 Again, there's lots of applications and tools 655 00:27:53,270 --> 00:27:54,830 that we'll will to do that for you. 656 00:27:54,830 --> 00:27:58,940 We've just found that simple folder naming and file naming 657 00:27:58,940 --> 00:28:01,910 schemes gets you a long way. 658 00:28:01,910 --> 00:28:04,850 They require no tools, and they make 659 00:28:04,850 --> 00:28:07,755 it really easy for other people to join in your project 660 00:28:07,755 --> 00:28:09,380 because they can just look at the data, 661 00:28:09,380 --> 00:28:11,550 and it's pretty clear what's going on. 662 00:28:11,550 --> 00:28:14,090 And it also makes the data easy to share. 663 00:28:14,090 --> 00:28:16,550 So that's another simple thing. 664 00:28:16,550 --> 00:28:20,690 And then finally, I want to talk about every single application 665 00:28:20,690 --> 00:28:23,630 we've demoed here involve data steps that 666 00:28:23,630 --> 00:28:26,900 were designed for sharing. 667 00:28:26,900 --> 00:28:30,250 Designing your data such that you can give it to others 668 00:28:30,250 --> 00:28:33,100 actually addresses a lot of the data architecture 669 00:28:33,100 --> 00:28:37,000 problems that you will encounter yourself. 670 00:28:37,000 --> 00:28:40,090 By thinking about, OK, how do I make this data such that I 671 00:28:40,090 --> 00:28:41,560 can give it to someone else, you're 672 00:28:41,560 --> 00:28:44,458 really making a much better product for yourself. 673 00:28:44,458 --> 00:28:46,000 You're making it so that other people 674 00:28:46,000 --> 00:28:47,770 can join your team much more quickly 675 00:28:47,770 --> 00:28:50,380 and be much more effective. 676 00:28:50,380 --> 00:28:52,690 And so, thinking about sharing of data 677 00:28:52,690 --> 00:28:54,620 is really an important thing. 678 00:28:54,620 --> 00:28:58,660 And what you'll discover is that generalize sharing of data 679 00:28:58,660 --> 00:28:59,800 is very hard. 680 00:28:59,800 --> 00:29:01,840 Just pointing to a data set and saying I 681 00:29:01,840 --> 00:29:04,190 want to share it is very challenging. 682 00:29:04,190 --> 00:29:06,140 There's a lot of barriers to do that. 683 00:29:06,140 --> 00:29:09,130 But if you have some idea what the end purpose is, 684 00:29:09,130 --> 00:29:11,050 what the particular application that you want 685 00:29:11,050 --> 00:29:14,380 to share this data for, then it becomes a lot easier, 686 00:29:14,380 --> 00:29:16,643 both technically and administratively. 687 00:29:16,643 --> 00:29:18,310 And we will talk a little bit about some 688 00:29:18,310 --> 00:29:20,102 of the administrative burdens that you need 689 00:29:20,102 --> 00:29:21,370 to go through to share data. 690 00:29:21,370 --> 00:29:24,820 And this is where knowing the purpose, in this case, we want 691 00:29:24,820 --> 00:29:28,750 to do particular AI application, dramatically 692 00:29:28,750 --> 00:29:34,270 simplifies the process of doing data sharing. 693 00:29:34,270 --> 00:29:37,180 So just some motivation, this is a slide 694 00:29:37,180 --> 00:29:39,610 that Vijay presented earlier. 695 00:29:39,610 --> 00:29:41,470 And the big thing to emphasize here 696 00:29:41,470 --> 00:29:43,270 is we have a bunch of breakthroughs, 697 00:29:43,270 --> 00:29:46,720 and the key is the bottom row, which 698 00:29:46,720 --> 00:29:51,460 says that the time from algorithm to break through 699 00:29:51,460 --> 00:29:54,370 is, like, two decades, but the time 700 00:29:54,370 --> 00:29:58,630 from getting data set that is a shareable, well architected 701 00:29:58,630 --> 00:30:01,480 data set to break through was much, much shorter, 702 00:30:01,480 --> 00:30:03,220 was only three years. 703 00:30:03,220 --> 00:30:07,060 And so that's, kind of, motivation for the fact 704 00:30:07,060 --> 00:30:13,000 that why data is so important to making progress in AI. 705 00:30:13,000 --> 00:30:15,670 The data is actually writing the program, 706 00:30:15,670 --> 00:30:17,840 as Vijay mentioned in the first lecture. 707 00:30:17,840 --> 00:30:20,020 And so having data is really important. 708 00:30:20,020 --> 00:30:22,630 And it has been often observed that 80% 709 00:30:22,630 --> 00:30:26,740 of the effort associated with building an AI system 710 00:30:26,740 --> 00:30:30,460 is data wrangling, or data architecture, as we call it. 711 00:30:30,460 --> 00:30:35,890 And one of the benefits of these challenge problems, 712 00:30:35,890 --> 00:30:37,870 or these open and shareable data sets, 713 00:30:37,870 --> 00:30:40,750 is that they have done a lot of the data architecture and data 714 00:30:40,750 --> 00:30:43,540 wrangling for the community, which 715 00:30:43,540 --> 00:30:46,630 is why they're able to have a lot of people work on it 716 00:30:46,630 --> 00:30:48,550 and make progress on it. 717 00:30:48,550 --> 00:30:50,360 If you're doing a new application, 718 00:30:50,360 --> 00:30:52,810 you have to do this data architecture and data 719 00:30:52,810 --> 00:30:54,640 wrangling yourself. 720 00:30:54,640 --> 00:30:56,830 And we're going to share with you the techniques 721 00:30:56,830 --> 00:30:57,955 for doing that effectively. 722 00:30:57,955 --> 00:31:00,460 And I would say, in most situations, 723 00:31:00,460 --> 00:31:03,430 this 80% of the effort actually is 724 00:31:03,430 --> 00:31:05,830 going to discovering the methods that we are going 725 00:31:05,830 --> 00:31:08,020 to cover in a few minutes. 726 00:31:08,020 --> 00:31:10,850 That is, it takes people a long time to figure out, 727 00:31:10,850 --> 00:31:14,470 oh simple tables are good. 728 00:31:14,470 --> 00:31:16,510 We can organize our data that way. 729 00:31:16,510 --> 00:31:20,290 And just having a good set of folder names and file names 730 00:31:20,290 --> 00:31:24,423 really solves most of the problem, right? 731 00:31:24,423 --> 00:31:25,840 That's the beauty of the solution. 732 00:31:25,840 --> 00:31:28,210 It sounds so phenomenally simple. 733 00:31:28,210 --> 00:31:30,670 How could it possibly work? 734 00:31:30,670 --> 00:31:32,440 That's actually-- you know, if there's 735 00:31:32,440 --> 00:31:33,940 one thing we figure out here at MIT, 736 00:31:33,940 --> 00:31:38,020 it's that simple can be very good and very powerful. 737 00:31:38,020 --> 00:31:42,520 So as was mentioned before, sort of these neural networks, 738 00:31:42,520 --> 00:31:47,770 as depicted here, drive this space of applications. 739 00:31:47,770 --> 00:31:50,650 Most of the innovations we've seen in the past few years 740 00:31:50,650 --> 00:31:53,290 have been using these neural networks. 741 00:31:53,290 --> 00:31:56,290 This one slide sort of covers all of neural networks. 742 00:31:56,290 --> 00:31:58,780 In one slide, you have the picture, 743 00:31:58,780 --> 00:32:01,690 you have the input features, which are shown in gray dots, 744 00:32:01,690 --> 00:32:03,520 you have the output classifications 745 00:32:03,520 --> 00:32:05,248 that are shown in blue dots, and then 746 00:32:05,248 --> 00:32:06,790 you have their neural network and all 747 00:32:06,790 --> 00:32:13,000 their various weight matrices or weight tables in between. 748 00:32:13,000 --> 00:32:18,040 If you had one goal, in terms of understanding AI, 749 00:32:18,040 --> 00:32:22,070 making a commitment to understanding this one slide, 750 00:32:22,070 --> 00:32:24,610 then you pretty much understand a good fraction 751 00:32:24,610 --> 00:32:27,190 of the entire field. 752 00:32:27,190 --> 00:32:30,490 And while you are here at MIT, please 753 00:32:30,490 --> 00:32:35,470 feel free to talk to any of us about this slide 754 00:32:35,470 --> 00:32:39,460 or these concepts and we will work with you 755 00:32:39,460 --> 00:32:41,470 to help you understand it. 756 00:32:41,470 --> 00:32:43,630 Because anyone who's ever coming to you 757 00:32:43,630 --> 00:32:46,690 and offering an AI solution that uses neural networks, which 758 00:32:46,690 --> 00:32:51,280 is a significant fraction of them, they're just doing this. 759 00:32:51,280 --> 00:32:52,870 This is all they're doing. 760 00:32:52,870 --> 00:32:54,850 It's all on this one slide. 761 00:32:54,850 --> 00:32:59,650 Many of them, by the way, do not know the math 762 00:32:59,650 --> 00:33:02,110 that one equation here that's in this this. 763 00:33:02,110 --> 00:33:04,000 They have software, they play with knobs, 764 00:33:04,000 --> 00:33:07,150 but they don't even understand the underlying mathematics. 765 00:33:07,150 --> 00:33:10,720 And just understanding this math, taking the time to do it, 766 00:33:10,720 --> 00:33:14,590 you will know more than anyone who is actually trying 767 00:33:14,590 --> 00:33:18,890 to sell you AI solutions. 768 00:33:18,890 --> 00:33:23,690 Now, as Sid said, he talked about batch size-- 769 00:33:23,690 --> 00:33:26,690 I just added batches to this finger 770 00:33:26,690 --> 00:33:31,160 to emphasize the fact that all of these letters 771 00:33:31,160 --> 00:33:33,970 are mostly uppercase. 772 00:33:33,970 --> 00:33:37,850 And if you want to know one thing about matrix math, 773 00:33:37,850 --> 00:33:41,330 it is the easiest way to go from a vector equation 774 00:33:41,330 --> 00:33:44,990 to a matrix equation is to take all the lowercase letters 775 00:33:44,990 --> 00:33:46,580 and make them uppercase. 776 00:33:46,580 --> 00:33:49,790 And most of the time, that actually just works. 777 00:33:49,790 --> 00:33:51,650 In fact, typically what we do if you're 778 00:33:51,650 --> 00:33:54,500 a linear algebraic person, is we do that, 779 00:33:54,500 --> 00:33:56,155 and see if it breaks anything. 780 00:33:56,155 --> 00:33:57,530 And if it doesn't break anything, 781 00:33:57,530 --> 00:33:58,970 well, we assume it's right. 782 00:33:58,970 --> 00:34:02,600 And all of these matrices, right, are tabular. 783 00:34:02,600 --> 00:34:04,490 A matrix is just a big table. 784 00:34:04,490 --> 00:34:06,530 It's a great big spreadsheet. 785 00:34:06,530 --> 00:34:10,550 So basically, all of neural networks in AI 786 00:34:10,550 --> 00:34:13,489 is just about taking spreadsheets and transforming 787 00:34:13,489 --> 00:34:14,699 them. 788 00:34:14,699 --> 00:34:16,909 And so, this is just motivation for why 789 00:34:16,909 --> 00:34:19,040 these tables are so important. 790 00:34:19,040 --> 00:34:20,620 Here is the standard pipeline. 791 00:34:20,620 --> 00:34:23,179 Vijay talked about it briefly. 792 00:34:23,179 --> 00:34:26,449 And again, what we discovered is that, in most data 793 00:34:26,449 --> 00:34:30,320 analysis, which includes AI, this pipeline 794 00:34:30,320 --> 00:34:34,219 represents, pretty much, what most applications are doing. 795 00:34:34,219 --> 00:34:36,989 It's not that any application does all of that, 796 00:34:36,989 --> 00:34:38,780 but most of the steps in an application 797 00:34:38,780 --> 00:34:40,610 fall into this pipeline. 798 00:34:40,610 --> 00:34:43,550 Basically, you have raw data, you parse it, usually 799 00:34:43,550 --> 00:34:47,639 into some tabular format, you may ingest it into a database. 800 00:34:47,639 --> 00:34:51,690 You then query that database for a subset to do analysis, 801 00:34:51,690 --> 00:34:55,260 or you might scan the whole database 802 00:34:55,260 --> 00:34:57,230 if you're doing all the files, if you 803 00:34:57,230 --> 00:35:00,530 want to do some application that involves all the data. 804 00:35:00,530 --> 00:35:03,380 And then, you analyze it, and then you visualize it. 805 00:35:03,380 --> 00:35:06,800 And so, this is sort of the big steps in a pipeline 806 00:35:06,800 --> 00:35:07,850 that we see. 807 00:35:07,850 --> 00:35:09,970 And again, most applications that we'll will 808 00:35:09,970 --> 00:35:11,710 use three of these steps. 809 00:35:11,710 --> 00:35:13,610 They're not using all of them, but it's 810 00:35:13,610 --> 00:35:17,270 a great framework for looking at any data analysis pipeline. 811 00:35:17,270 --> 00:35:20,300 It's easy to use, it's easy to understand, 812 00:35:20,300 --> 00:35:21,530 and it's easy to maintain. 813 00:35:21,530 --> 00:35:25,010 And you know, we like all of those features. 814 00:35:25,010 --> 00:35:27,957 And I'll get into those in more detail. 815 00:35:27,957 --> 00:35:30,290 And then finally, I've already talked with significantly 816 00:35:30,290 --> 00:35:31,460 about data sharing. 817 00:35:31,460 --> 00:35:33,110 There's just some examples of data 818 00:35:33,110 --> 00:35:36,290 we've put up for sharing, the moments in time 819 00:35:36,290 --> 00:35:38,140 challenge, the graph challenge. 820 00:35:38,140 --> 00:35:40,970 And again, creating, sharing quality data, 821 00:35:40,970 --> 00:35:45,560 really, it's not just a service to the community. 822 00:35:45,560 --> 00:35:48,620 It's principally a service to yourself. 823 00:35:48,620 --> 00:35:52,490 And in fact, we've worked with a number of applications 824 00:35:52,490 --> 00:35:55,550 and sponsors where we help them do this, 825 00:35:55,550 --> 00:35:58,250 and they find the data that they made shareable is instantly 826 00:35:58,250 --> 00:36:00,350 the most valuable data in their organization 827 00:36:00,350 --> 00:36:02,600 because it's shareable within their organization too. 828 00:36:02,600 --> 00:36:04,808 The same things that make it shareable to the broader 829 00:36:04,808 --> 00:36:07,070 community make it shareable within your organization 830 00:36:07,070 --> 00:36:08,070 to other partners. 831 00:36:08,070 --> 00:36:09,570 So that's really important. 832 00:36:09,570 --> 00:36:11,920 And again, as I said before, co-designing 833 00:36:11,920 --> 00:36:15,020 the sharing and the purpose of the data, 834 00:36:15,020 --> 00:36:17,660 is quite unique to do that. 835 00:36:17,660 --> 00:36:19,400 In order to do effective sharing, 836 00:36:19,400 --> 00:36:23,120 you can't just share data arbitrarily. 837 00:36:23,120 --> 00:36:25,070 That can be very difficult to do, 838 00:36:25,070 --> 00:36:27,500 but if you have a particular application in mind, 839 00:36:27,500 --> 00:36:29,390 sharing the data is quite simple. 840 00:36:29,390 --> 00:36:31,872 And I'll get into that later. 841 00:36:31,872 --> 00:36:33,830 So I'm just going to, then, sort of hammer home 842 00:36:33,830 --> 00:36:35,630 some of these points, here. 843 00:36:35,630 --> 00:36:37,965 So nothing really new, here, just sort 844 00:36:37,965 --> 00:36:40,880 of repeating what I've already said about these things 845 00:36:40,880 --> 00:36:43,370 but with adding a little bit more effect. then, to them. 846 00:36:43,370 --> 00:36:45,920 So why do we like tables? 847 00:36:45,920 --> 00:36:48,590 Well, we've been using tables to analyze data 848 00:36:48,590 --> 00:36:51,210 for thousands of years. 849 00:36:51,210 --> 00:36:54,740 So on the right is a 12th century manuscript. 850 00:36:54,740 --> 00:36:56,960 It's actually a table of pies. 851 00:36:56,960 --> 00:36:59,610 And if you can read Latin and Roman numerals, 852 00:36:59,610 --> 00:37:01,610 you already can tell it is probably, oh, there's 853 00:37:01,610 --> 00:37:03,590 some columns, and there's some rows. 854 00:37:03,590 --> 00:37:05,570 And those Roman numerals are numbers. 855 00:37:05,570 --> 00:37:09,410 Even 800 years later, we can still 856 00:37:09,410 --> 00:37:11,660 see that this is a data table. 857 00:37:11,660 --> 00:37:12,950 And it goes back further. 858 00:37:12,950 --> 00:37:17,120 You can find cuneiform tablets that are thousands of years old 859 00:37:17,120 --> 00:37:19,670 that are clearly written in a tabular form. 860 00:37:19,670 --> 00:37:23,180 So humans have been looking at data in a tabular form 861 00:37:23,180 --> 00:37:25,670 for thousands of years. 862 00:37:25,670 --> 00:37:29,700 Part of that is how our brain is actually wired. 863 00:37:29,700 --> 00:37:34,160 One, t the 2D projection, since we live in 3D space, 864 00:37:34,160 --> 00:37:35,990 is the only projection that will show you 865 00:37:35,990 --> 00:37:37,965 all the data without occlusion. 866 00:37:37,965 --> 00:37:39,590 So if you want to look at all the data, 867 00:37:39,590 --> 00:37:41,390 it kind of has to be presented in 2D form. 868 00:37:41,390 --> 00:37:43,740 We have a 2D optical system. 869 00:37:43,740 --> 00:37:48,170 In addition, we have special hard wired parts of our eyes 870 00:37:48,170 --> 00:37:52,550 that detect vertical and horizontal lines for, 871 00:37:52,550 --> 00:37:54,950 presumably, detecting the horizon or trees 872 00:37:54,950 --> 00:37:56,210 or who knows what. 873 00:37:56,210 --> 00:37:59,420 And that makes it very easy for our eyes 874 00:37:59,420 --> 00:38:02,130 to look at tabular data. 875 00:38:02,130 --> 00:38:04,640 And because of this, tabular data 876 00:38:04,640 --> 00:38:07,220 is compatible with almost every data 877 00:38:07,220 --> 00:38:09,800 analysis software on earth. 878 00:38:09,800 --> 00:38:11,490 If I make data analysis software and it 879 00:38:11,490 --> 00:38:13,940 can't read tables or write out tables, 880 00:38:13,940 --> 00:38:16,010 it's going to have a very small market share. 881 00:38:16,010 --> 00:38:20,800 And this is just an example of some of the software out there 882 00:38:20,800 --> 00:38:22,420 that is designed to use tables. 883 00:38:22,420 --> 00:38:26,410 So whether it be spreadsheets or databases or the neural network 884 00:38:26,410 --> 00:38:28,480 packages that we've already looked at, 885 00:38:28,480 --> 00:38:30,783 various languages and libraries, and even 886 00:38:30,783 --> 00:38:32,200 things that you wouldn't think are 887 00:38:32,200 --> 00:38:36,010 tabular like hierarchical file formats, like JSON and XML, 888 00:38:36,010 --> 00:38:40,130 can actually be made compatible with a tablet format. 889 00:38:40,130 --> 00:38:42,040 So I'm just going to walk through these 890 00:38:42,040 --> 00:38:44,450 in a little more detail. 891 00:38:44,450 --> 00:38:47,200 So of course, our favorite is spreadsheets. 892 00:38:47,200 --> 00:38:50,242 Here's an example of a spreadsheet, on the left. 893 00:38:50,242 --> 00:38:51,700 And what's great about it is that I 894 00:38:51,700 --> 00:38:55,810 have four different types of data that are very, very 895 00:38:55,810 --> 00:38:57,460 different, and you can all look at them 896 00:38:57,460 --> 00:38:59,800 and sort of get a sense of what they are. 897 00:38:59,800 --> 00:39:03,490 And that just shows how flexible this tabular spreadsheet 898 00:39:03,490 --> 00:39:07,810 form is, is that we can handle really, really diverse types 899 00:39:07,810 --> 00:39:10,660 of data in a way that's intuitive to us. 900 00:39:10,660 --> 00:39:12,460 Some of the biggest pieces of software 901 00:39:12,460 --> 00:39:14,530 that are huge that implement this 902 00:39:14,530 --> 00:39:18,490 are Microsoft Excel, Google Sheets, Apple Numbers. 903 00:39:18,490 --> 00:39:22,480 It's used by about 100 million people every day. 904 00:39:22,480 --> 00:39:27,220 And so, this is the most popular tool for using data analysis, 905 00:39:27,220 --> 00:39:29,160 and there's a good reason for it. 906 00:39:29,160 --> 00:39:32,150 And thinking of things in a tabular format 907 00:39:32,150 --> 00:39:34,390 so they can be pulled into this really makes 908 00:39:34,390 --> 00:39:36,350 communication a lot easier. 909 00:39:36,350 --> 00:39:39,100 And so again, just sort of hammering home this element 910 00:39:39,100 --> 00:39:40,150 of tables. 911 00:39:40,150 --> 00:39:42,400 Within that, there's specific formats 912 00:39:42,400 --> 00:39:44,660 that we have found to be very popular. 913 00:39:44,660 --> 00:39:50,560 So CSV and TSV are just non-proprietary, plain text 914 00:39:50,560 --> 00:39:52,350 ways of organizing data. 915 00:39:52,350 --> 00:39:54,730 We highly encourage people to use them because then it 916 00:39:54,730 --> 00:39:56,930 can be used by any tool. 917 00:39:56,930 --> 00:39:59,920 CSV just stands for comma separated values. 918 00:39:59,920 --> 00:40:03,940 It's usually filename.csv in lower case or filename.CSV 919 00:40:03,940 --> 00:40:05,050 in upper case. 920 00:40:05,050 --> 00:40:08,530 And all it means is that the columns-- 921 00:40:08,530 --> 00:40:12,700 So each row is terminated by a new line, and within each row 922 00:40:12,700 --> 00:40:14,920 the columns are separated by a comma. 923 00:40:14,920 --> 00:40:17,110 That's all that means. 924 00:40:17,110 --> 00:40:19,420 We tend to encourage people to use 925 00:40:19,420 --> 00:40:22,960 TSV, which stands for tab separated values, 926 00:40:22,960 --> 00:40:26,200 because we can replace the comma with a tab. 927 00:40:26,200 --> 00:40:29,110 And that allows us to write higher performance 928 00:40:29,110 --> 00:40:30,580 readers and writers. 929 00:40:30,580 --> 00:40:35,830 Because it is not uncommon to have a comma within a value, 930 00:40:35,830 --> 00:40:39,160 and so it makes it very difficult to write parsers 931 00:40:39,160 --> 00:40:41,170 that can do that fast. 932 00:40:41,170 --> 00:40:44,810 But it's OK to say to a person generating data, 933 00:40:44,810 --> 00:40:50,800 if you put a tab inside a value that's a foul on you. 934 00:40:50,800 --> 00:40:52,270 You shouldn't do that. 935 00:40:52,270 --> 00:40:54,820 And so that means I can write a program that 936 00:40:54,820 --> 00:40:57,160 can read this data really fast, and it can 937 00:40:57,160 --> 00:40:58,820 write this data really fast. 938 00:40:58,820 --> 00:41:01,960 And that's why, in the big data or AI space, 939 00:41:01,960 --> 00:41:05,430 people often tend to use TSV. 940 00:41:05,430 --> 00:41:07,840 And again, it's easily readable and writable 941 00:41:07,840 --> 00:41:09,640 by all those spreadsheet programs 942 00:41:09,640 --> 00:41:12,250 that I have said before. 943 00:41:12,250 --> 00:41:14,890 Databases play a very important role as well. 944 00:41:14,890 --> 00:41:18,750 There's several different types of databases that are called 945 00:41:18,750 --> 00:41:21,790 SQL, NoSQL, and NewSQL . 946 00:41:21,790 --> 00:41:23,620 They're good for different things. 947 00:41:23,620 --> 00:41:26,080 I won't dwell on that here, but there are 948 00:41:26,080 --> 00:41:27,820 different types of databases. 949 00:41:27,820 --> 00:41:30,670 All of them operate on tables. 950 00:41:30,670 --> 00:41:32,800 In fact, the central object in a database 951 00:41:32,800 --> 00:41:35,230 is literally called a database table. 952 00:41:35,230 --> 00:41:37,720 And so, databases play an important role. 953 00:41:37,720 --> 00:41:41,440 I won't go into detail here, but I just want to mention this. 954 00:41:41,440 --> 00:41:44,080 Likewise, you've already had some demonstrations 955 00:41:44,080 --> 00:41:45,700 of different languages and libraries. 956 00:41:45,700 --> 00:41:47,860 I think everything you saw was in Python. 957 00:41:47,860 --> 00:41:51,700 There's other languages-- Julia, MATLAB, Octave, R-- 958 00:41:51,700 --> 00:41:53,890 many other languages that play important roles 959 00:41:53,890 --> 00:41:55,120 in data analysis. 960 00:41:55,120 --> 00:41:56,920 We've talked about Jupyter Notebooks, which 961 00:41:56,920 --> 00:42:00,850 is a universal interface to these different languages 962 00:42:00,850 --> 00:42:03,940 and libraries to allow you to do data analysis in addition 963 00:42:03,940 --> 00:42:06,700 to the various machine learning tools. 964 00:42:06,700 --> 00:42:08,297 We've talked a lot about TensorFlow, 965 00:42:08,297 --> 00:42:09,130 but there's others-- 966 00:42:09,130 --> 00:42:11,980 Theano, Caffe, NVIDEA Digits-- 967 00:42:11,980 --> 00:42:15,940 all are examples of different tools for doing AI. 968 00:42:15,940 --> 00:42:19,540 And again, they love to ingest tabular data, 969 00:42:19,540 --> 00:42:23,890 and they're very good at writing out tabular data completely 970 00:42:23,890 --> 00:42:26,360 consistent with that here. 971 00:42:26,360 --> 00:42:30,310 And then finally, I want to mention XML and JSON, 972 00:42:30,310 --> 00:42:33,730 which are very popular in object-oriented languages, 973 00:42:33,730 --> 00:42:38,650 like C++, Java, and Python. 974 00:42:38,650 --> 00:42:44,830 Those languages rely on data structures, which are basically 975 00:42:44,830 --> 00:42:48,130 just a set of fields, and then those fields 976 00:42:48,130 --> 00:42:51,190 can also be data structures, and this can continue on. 977 00:42:51,190 --> 00:42:53,890 These are called hierarchical data structures, 978 00:42:53,890 --> 00:42:56,540 the way these languages are organized. 979 00:42:56,540 --> 00:42:59,080 And so it's very natural for them to have formats 980 00:42:59,080 --> 00:43:02,560 that they can just say, write out this whole blob of data, 981 00:43:02,560 --> 00:43:05,350 and it will write it out into a hierarchical form called 982 00:43:05,350 --> 00:43:07,300 XML or JSON. 983 00:43:07,300 --> 00:43:10,990 However, it's not very human readable. 984 00:43:10,990 --> 00:43:14,680 And I just told you that all the data analysis tools 985 00:43:14,680 --> 00:43:18,040 want tables, which means that these formats aren't 986 00:43:18,040 --> 00:43:22,060 necessarily very good for doing data analysis on. 987 00:43:22,060 --> 00:43:24,520 Fortunately, we've discovered some ways 988 00:43:24,520 --> 00:43:28,600 that we can basically convert these hierarchical formats 989 00:43:28,600 --> 00:43:31,540 into a giant, what we call, sparse table. 990 00:43:31,540 --> 00:43:35,260 And sparse just means a table where lots of the values 991 00:43:35,260 --> 00:43:38,470 are empty. 992 00:43:38,470 --> 00:43:40,280 The entries are empty. 993 00:43:40,280 --> 00:43:42,610 And so we can just use that space 994 00:43:42,610 --> 00:43:45,490 to represent this hierarchy, and all the tools 995 00:43:45,490 --> 00:43:51,180 we've just talked about can ingest this data pretty easily. 996 00:43:51,180 --> 00:43:57,720 Because tables are so popular, every single different approach 997 00:43:57,720 --> 00:43:58,980 or software package-- 998 00:43:58,980 --> 00:44:02,550 and we have a bunch of them listed in the first column, 999 00:44:02,550 --> 00:44:04,020 here-- 1000 00:44:04,020 --> 00:44:07,080 assign different names to the standard ways 1001 00:44:07,080 --> 00:44:10,170 we refer to the elements of a table. 1002 00:44:10,170 --> 00:44:11,440 And there aren't that many. 1003 00:44:11,440 --> 00:44:13,107 So when you think about a table, there's 1004 00:44:13,107 --> 00:44:15,990 certain standard terms for dealing with what 1005 00:44:15,990 --> 00:44:17,640 do we call the table? 1006 00:44:17,640 --> 00:44:19,140 In Excel it's called a sheet. 1007 00:44:21,660 --> 00:44:24,570 How do we refer to a particular row? 1008 00:44:24,570 --> 00:44:26,550 What do we call a whole row? 1009 00:44:26,550 --> 00:44:29,160 How do we refer to a particular column? 1010 00:44:29,160 --> 00:44:31,260 What do we call a whole column? 1011 00:44:31,260 --> 00:44:33,030 What do we call an entry? 1012 00:44:33,030 --> 00:44:35,530 And what kind of math can we do on it? 1013 00:44:35,530 --> 00:44:37,620 And so, whenever you're encountering 1014 00:44:37,620 --> 00:44:38,970 a new piece of software-- 1015 00:44:38,970 --> 00:44:41,160 And we deal with it all the time, but a lot of times 1016 00:44:41,160 --> 00:44:43,620 you'll get software, and maybe it's related to data 1017 00:44:43,620 --> 00:44:46,540 analysis seems really, really confusing. 1018 00:44:46,540 --> 00:44:48,960 It's probably working on tables, and so all you 1019 00:44:48,960 --> 00:44:53,320 have to need to learn is what are they calling these things. 1020 00:44:53,320 --> 00:44:55,800 And different software gives it different names, 1021 00:44:55,800 --> 00:44:58,830 just for arbitrary historical reasons. 1022 00:44:58,830 --> 00:45:03,060 And so, if you can figure out what it's calling these things, 1023 00:45:03,060 --> 00:45:04,630 then you'll discover the software 1024 00:45:04,630 --> 00:45:06,610 is a lot easier to understand. 1025 00:45:06,610 --> 00:45:10,740 So we definitely encourage you to just map whatever software 1026 00:45:10,740 --> 00:45:12,810 into these concepts, and all of sudden life 1027 00:45:12,810 --> 00:45:15,970 becomes a lot easier. 1028 00:45:15,970 --> 00:45:20,070 And because tabular data is used so commonly, 1029 00:45:20,070 --> 00:45:23,070 whether it be in spreadsheets or analyzing graphs or databases 1030 00:45:23,070 --> 00:45:26,190 or matrices, it's a natural interchange format. 1031 00:45:26,190 --> 00:45:30,930 So by reading in tabular data and writing out tabular data, 1032 00:45:30,930 --> 00:45:34,800 you're preserving your ability to have that data go 1033 00:45:34,800 --> 00:45:36,390 into any application. 1034 00:45:36,390 --> 00:45:38,250 Again, you have to be careful about people 1035 00:45:38,250 --> 00:45:40,740 who want to write proprietary formats 1036 00:45:40,740 --> 00:45:42,300 because they're basically, sort of, 1037 00:45:42,300 --> 00:45:45,420 preventing you from basically working 1038 00:45:45,420 --> 00:45:47,310 with other tools and other communities. 1039 00:45:47,310 --> 00:45:51,035 And this is another reason why tabular data is so great. 1040 00:45:51,035 --> 00:45:53,160 All right, so moving on here, I wanted to just talk 1041 00:45:53,160 --> 00:45:55,150 about basic files and folders. 1042 00:45:55,150 --> 00:45:58,710 So hopefully, you know, tables are good. 1043 00:45:58,710 --> 00:46:00,630 That's sort of lesson one. 1044 00:46:00,630 --> 00:46:03,720 And we're going to talk about how files and folder names, 1045 00:46:03,720 --> 00:46:05,980 well chosen is also good. 1046 00:46:05,980 --> 00:46:06,480 Right? 1047 00:46:06,480 --> 00:46:09,280 Again, this sounds so obvious, but it's really the thing 1048 00:46:09,280 --> 00:46:11,280 that we've discovered, in working with thousands 1049 00:46:11,280 --> 00:46:13,910 of MIT researchers over the past decade, 1050 00:46:13,910 --> 00:46:17,370 that these are the fundamentals that help you out. 1051 00:46:17,370 --> 00:46:21,160 I've already mentioned this pipeline that we talked about. 1052 00:46:21,160 --> 00:46:25,300 It's a way we organize the data. 1053 00:46:25,300 --> 00:46:26,580 It's easy to build. 1054 00:46:26,580 --> 00:46:30,660 It has built in tools in every single computer 1055 00:46:30,660 --> 00:46:33,810 you have to support this to just create folders and files. 1056 00:46:33,810 --> 00:46:35,895 That means it has no technical debt. 1057 00:46:35,895 --> 00:46:37,770 That is, you never have to worry about if you 1058 00:46:37,770 --> 00:46:40,080 have folders and files that in 10 years, 1059 00:46:40,080 --> 00:46:43,030 oh my god, folders and files are going to go away. 1060 00:46:43,030 --> 00:46:45,230 They've been around for 50 years, 1061 00:46:45,230 --> 00:46:47,310 in pretty much their present form. 1062 00:46:47,310 --> 00:46:49,720 We don't anticipate them changing. 1063 00:46:49,720 --> 00:46:51,570 They're easier to understand. 1064 00:46:51,570 --> 00:46:53,760 I already talked about tabular file formats 1065 00:46:53,760 --> 00:46:55,710 and how useful they are. 1066 00:46:55,710 --> 00:46:57,480 And we will talk about a simple naming 1067 00:46:57,480 --> 00:46:59,800 scheme for your data files. 1068 00:46:59,800 --> 00:47:02,400 AI requires a lot of data, which means, 1069 00:47:02,400 --> 00:47:05,280 often, thousands or millions of files. 1070 00:47:05,280 --> 00:47:10,290 Having good schemes for naming them really helps you and helps 1071 00:47:10,290 --> 00:47:12,450 share the data amongst people in your team. 1072 00:47:12,450 --> 00:47:14,840 They can just look at the folders and files 1073 00:47:14,840 --> 00:47:17,110 and sort of understand what's going on. 1074 00:47:17,110 --> 00:47:19,735 And again, this folder structure is very easy to maintain, 1075 00:47:19,735 --> 00:47:21,360 doesn't require any special technology, 1076 00:47:21,360 --> 00:47:24,270 and anyone can look at it 1077 00:47:24,270 --> 00:47:27,330 Getting into some more details about this table format, 1078 00:47:27,330 --> 00:47:30,715 some sort of real, sort of in the weeds types of things that 1079 00:47:30,715 --> 00:47:31,590 are really important. 1080 00:47:31,590 --> 00:47:33,340 I've always talked about you want to avoid 1081 00:47:33,340 --> 00:47:35,880 proprietary formats if you can. 1082 00:47:35,880 --> 00:47:39,570 Using CSV and TSV are great. 1083 00:47:39,570 --> 00:47:41,340 They might be a little bulkier. 1084 00:47:41,340 --> 00:47:43,140 You can compress them. 1085 00:47:43,140 --> 00:47:45,330 No problem with doing that. 1086 00:47:45,330 --> 00:47:47,450 Then you get pretty much all the benefits 1087 00:47:47,450 --> 00:47:50,910 of a proprietary format but in a way that's generally usable. 1088 00:47:53,850 --> 00:47:57,180 Within a TSV or CSV file, it's good practice 1089 00:47:57,180 --> 00:48:00,150 to make sure that each column label is unique. 1090 00:48:00,150 --> 00:48:01,920 So that's actually a big help. 1091 00:48:01,920 --> 00:48:03,420 You don't want to have, essentially, 1092 00:48:03,420 --> 00:48:07,500 the same column label or column name occur over and over again. 1093 00:48:07,500 --> 00:48:11,910 And likewise, it's helpful if the first column are the row 1094 00:48:11,910 --> 00:48:15,330 labels, and those are each unique as well. 1095 00:48:15,330 --> 00:48:18,840 And just those two practices make it a lot easier 1096 00:48:18,840 --> 00:48:21,730 to read and organize the data. 1097 00:48:21,730 --> 00:48:25,770 Otherwise, you end up having to do various transformations. 1098 00:48:25,770 --> 00:48:28,630 If a lot of the entries are empty, i.e., 1099 00:48:28,630 --> 00:48:31,980 the table is very sparse, there's another format we use, 1100 00:48:31,980 --> 00:48:34,410 which is sometimes called triples format. 1101 00:48:34,410 --> 00:48:36,600 Basically, every single entry in a table 1102 00:48:36,600 --> 00:48:39,090 can be referred to by its row label, 1103 00:48:39,090 --> 00:48:40,560 it's column label, and it's value. 1104 00:48:40,560 --> 00:48:41,880 That forms a triple. 1105 00:48:41,880 --> 00:48:45,690 So you can just write another file, can be a TSV file, 1106 00:48:45,690 --> 00:48:47,320 that has three columns. 1107 00:48:47,320 --> 00:48:49,050 The first column is the row label, 1108 00:48:49,050 --> 00:48:52,250 the next column is the column label, 1109 00:48:52,250 --> 00:48:54,490 and the next column is the value label. 1110 00:48:54,490 --> 00:48:57,988 And most software can read this in pretty nice-- 1111 00:48:57,988 --> 00:48:59,530 it's a very convenient way of dealing 1112 00:48:59,530 --> 00:49:04,110 with very, very sparse data without losing any benefits. 1113 00:49:04,110 --> 00:49:07,000 So that's a very good thing. 1114 00:49:07,000 --> 00:49:10,430 In terms of actual file naming, as I said before, 1115 00:49:10,430 --> 00:49:12,890 you want to avoid having lots of tiny files. 1116 00:49:12,890 --> 00:49:15,640 Most file systems, most data systems, 1117 00:49:15,640 --> 00:49:19,540 do not work with small files really well. 1118 00:49:19,540 --> 00:49:21,670 You do want to compress. 1119 00:49:21,670 --> 00:49:24,370 Decompression doesn't take much time. 1120 00:49:24,370 --> 00:49:26,570 It's usually quite beneficial. 1121 00:49:26,570 --> 00:49:30,180 And so, it's better to have fewer larger files. 1122 00:49:30,180 --> 00:49:32,170 A sweet spot, for a long time, has 1123 00:49:32,170 --> 00:49:36,370 been file serving the one megabyte to 100 megabyte range. 1124 00:49:36,370 --> 00:49:39,070 This one megabyte is big enough so 1125 00:49:39,070 --> 00:49:42,460 that you can read it in and get high bandwidth, 1126 00:49:42,460 --> 00:49:46,060 get the data into your program quickly. 1127 00:49:46,060 --> 00:49:49,030 And up to 100 megabyte, once you start going beyond that, 1128 00:49:49,030 --> 00:49:52,430 you maybe can start exhausting the memory of your computer 1129 00:49:52,430 --> 00:49:54,290 or your processor. 1130 00:49:54,290 --> 00:49:56,740 So this is one megabyte to 100 megabyte 1131 00:49:56,740 --> 00:50:00,220 range has been a sweet spot for quite a while. 1132 00:50:00,220 --> 00:50:05,410 You want to keep directories to less than 1,000 files. 1133 00:50:05,410 --> 00:50:07,180 That's pretty common nowadays. 1134 00:50:07,180 --> 00:50:10,930 Creating directories with a really large number of files, 1135 00:50:10,930 --> 00:50:13,540 most systems don't really like that. 1136 00:50:13,540 --> 00:50:15,978 You'll find it's also hard to work with. 1137 00:50:15,978 --> 00:50:17,020 And so, that's something. 1138 00:50:17,020 --> 00:50:20,320 You want to keep your directories less than 1,000. 1139 00:50:20,320 --> 00:50:23,830 And so, you do that by using hierarchical directories. 1140 00:50:23,830 --> 00:50:25,690 And so, when it comes to naming files 1141 00:50:25,690 --> 00:50:28,900 to things that tend to be pretty universal, 1142 00:50:28,900 --> 00:50:33,020 is source and time of data. 1143 00:50:33,020 --> 00:50:35,230 Right? 1144 00:50:35,230 --> 00:50:38,200 Where you collected it and when you collected 1145 00:50:38,200 --> 00:50:41,350 it are pretty fundamental aspects that you 1146 00:50:41,350 --> 00:50:45,490 can rely on being a part of almost all data sets. 1147 00:50:45,490 --> 00:50:47,590 And so, this is a simple naming scheme 1148 00:50:47,590 --> 00:50:50,650 that we use that we find is very accessible. 1149 00:50:50,650 --> 00:50:53,560 We just create a hierarchical directory here. 1150 00:50:53,560 --> 00:50:55,990 It begins with source, and then we 1151 00:50:55,990 --> 00:50:59,890 have the years, the months, the days, the hours, the minutes, 1152 00:50:59,890 --> 00:51:02,365 however far you need to go to kind of hit 1153 00:51:02,365 --> 00:51:05,500 your sort of 1,000 files per folder window. 1154 00:51:05,500 --> 00:51:07,090 And then, we repeat all that name 1155 00:51:07,090 --> 00:51:09,730 within the file name itself, so that if the file ever 1156 00:51:09,730 --> 00:51:12,380 gets separated from its directory structure, 1157 00:51:12,380 --> 00:51:14,390 you still have that information. 1158 00:51:14,390 --> 00:51:16,930 And that's really important because it's easy for files 1159 00:51:16,930 --> 00:51:18,820 to get separated, and then, you know, 1160 00:51:18,820 --> 00:51:21,280 it's like a kid wandering around without his parents. 1161 00:51:21,280 --> 00:51:24,880 But if it's got a little label on it that says, 1162 00:51:24,880 --> 00:51:29,360 this is my address, then you can get the get the kid back home. 1163 00:51:29,360 --> 00:51:31,360 And likewise, sometimes you'll do 1164 00:51:31,360 --> 00:51:33,610 the time first and then source. 1165 00:51:33,610 --> 00:51:35,740 It really depends on the application. 1166 00:51:35,740 --> 00:51:36,670 They're both fine. 1167 00:51:36,670 --> 00:51:38,710 But again, this sounds really simple, 1168 00:51:38,710 --> 00:51:41,740 but it solves a lot of problems. 1169 00:51:41,740 --> 00:51:45,610 If you give data to people in this form, 1170 00:51:45,610 --> 00:51:47,200 they're going to know what it means. 1171 00:51:47,200 --> 00:51:47,700 Right? 1172 00:51:47,700 --> 00:51:49,510 It doesn't require a lot of explanations. 1173 00:51:49,510 --> 00:51:51,880 Like, dates are pretty obvious, source names 1174 00:51:51,880 --> 00:51:54,070 are pretty obvious, and then they can, basically, 1175 00:51:54,070 --> 00:51:56,500 work through it pretty nicely. 1176 00:51:56,500 --> 00:51:58,930 Databases and files can work together. 1177 00:51:58,930 --> 00:52:02,130 You'll have some people like, we do everything with files, 1178 00:52:02,130 --> 00:52:03,880 or we do everything with databases. 1179 00:52:03,880 --> 00:52:05,920 They do different things. 1180 00:52:05,920 --> 00:52:07,890 They serve different purposes. 1181 00:52:07,890 --> 00:52:11,200 So databases are good for quickly finding a small amount 1182 00:52:11,200 --> 00:52:13,840 of data in a big data set. 1183 00:52:13,840 --> 00:52:16,660 So if I want to look up a little bit of data in a big data set, 1184 00:52:16,660 --> 00:52:18,700 that's what databases do well. 1185 00:52:18,700 --> 00:52:21,190 Likewise, if I want to read a ton of data, 1186 00:52:21,190 --> 00:52:24,280 like all of the data, and I want it analyzed, 1187 00:52:24,280 --> 00:52:26,140 that's what file systems do well. 1188 00:52:26,140 --> 00:52:28,870 So you, a lot of times, keep them both around. 1189 00:52:28,870 --> 00:52:31,780 You have databases for looking up little elements quickly, 1190 00:52:31,780 --> 00:52:34,690 and you have data and file system 1191 00:52:34,690 --> 00:52:36,090 if I want to read a lot of data. 1192 00:52:36,090 --> 00:52:38,440 So they work together very harmoniously, they 1193 00:52:38,440 --> 00:52:40,130 serve different purposes. 1194 00:52:40,130 --> 00:52:43,990 And again, different databases are good for different things. 1195 00:52:43,990 --> 00:52:47,990 I won't belabor these, you can read them yourself. 1196 00:52:47,990 --> 00:52:50,200 But there are a variety of database technologies 1197 00:52:50,200 --> 00:52:52,370 out there, and they serve different purposes. 1198 00:52:52,370 --> 00:52:55,450 So let me talk about the folder structure in a little bit 1199 00:52:55,450 --> 00:52:56,070 more detail. 1200 00:52:56,070 --> 00:52:58,390 This is getting really like-- 1201 00:52:58,390 --> 00:53:00,080 It is really just this simple. 1202 00:53:00,080 --> 00:53:02,350 When we talk about our standard pipeline, 1203 00:53:02,350 --> 00:53:04,960 it's just a bunch of folders that we name, 1204 00:53:04,960 --> 00:53:07,150 like we show on the right. 1205 00:53:07,150 --> 00:53:14,390 So we have an overall folder that we usually call pipeline, 1206 00:53:14,390 --> 00:53:21,880 and then we basically label each step with step zero, step one, 1207 00:53:21,880 --> 00:53:24,860 so people know which is the beginning and which is the end 1208 00:53:24,860 --> 00:53:26,110 and what's happening in there. 1209 00:53:26,110 --> 00:53:29,560 So we have step zero Ra, step one parse, step two 1210 00:53:29,560 --> 00:53:32,530 ingest, step three query, step four analysis, 1211 00:53:32,530 --> 00:53:36,590 step five is fizz, if you're doing that. 1212 00:53:36,590 --> 00:53:42,340 And then, within each one of those, we'll have a folder, 1213 00:53:42,340 --> 00:53:44,710 we'll have a readme, which basically tells people 1214 00:53:44,710 --> 00:53:46,900 something about, OK, what's going on there. 1215 00:53:46,900 --> 00:53:48,790 Those readmes are really good. 1216 00:53:48,790 --> 00:53:53,230 Even if you never share this data with anybody else, 1217 00:53:53,230 --> 00:53:55,630 as you get on in life, you will often 1218 00:53:55,630 --> 00:53:58,360 be asked to revisit things, and just 1219 00:53:58,360 --> 00:54:01,300 having this information so that two years later, you 1220 00:54:01,300 --> 00:54:04,773 can go back and look at it, is really, really helpful. 1221 00:54:04,773 --> 00:54:06,190 Because otherwise, you'll discover 1222 00:54:06,190 --> 00:54:08,440 that you don't remember what it is you were doing. 1223 00:54:08,440 --> 00:54:09,880 So you write a little readmes that 1224 00:54:09,880 --> 00:54:12,320 give you some information about what's going on. 1225 00:54:12,320 --> 00:54:13,330 And then, within each step, there's 1226 00:54:13,330 --> 00:54:14,770 usually readme that talks to you about what 1227 00:54:14,770 --> 00:54:15,820 the instructions in here. 1228 00:54:15,820 --> 00:54:17,195 And then, we have a folder called 1229 00:54:17,195 --> 00:54:18,930 code and folder called data. 1230 00:54:18,930 --> 00:54:21,610 And basically, the code describes how 1231 00:54:21,610 --> 00:54:24,980 you put data into this folder. 1232 00:54:24,980 --> 00:54:28,870 So in the step zero Ra, this is often the code, maybe, 1233 00:54:28,870 --> 00:54:31,750 for downloading the data from the internet 1234 00:54:31,750 --> 00:54:33,520 or how you collected your data. 1235 00:54:33,520 --> 00:54:37,420 That's the code that, basically, puts the data into that folder. 1236 00:54:37,420 --> 00:54:39,250 And then, with the next folder here, 1237 00:54:39,250 --> 00:54:42,910 parse, this is the code that takes this raw data 1238 00:54:42,910 --> 00:54:46,480 and parses it and puts it into this folder. 1239 00:54:46,480 --> 00:54:49,060 And then, for ingest, this will be 1240 00:54:49,060 --> 00:54:52,030 the code that takes the data from the parse form, 1241 00:54:52,030 --> 00:54:53,620 sticks it in a database, and these 1242 00:54:53,620 --> 00:54:55,930 might be the log files of the database entry. 1243 00:54:55,930 --> 00:54:59,920 So a lot of times, when you just data into a database, 1244 00:54:59,920 --> 00:55:03,400 you'll have log files that keep track of what that happened. 1245 00:55:03,400 --> 00:55:05,860 And likewise, this continues on down. 1246 00:55:05,860 --> 00:55:07,120 So very simple. 1247 00:55:07,120 --> 00:55:12,550 And I would offer to you, if you someone gave you this, 1248 00:55:12,550 --> 00:55:15,370 and you looked at it, without any documentation, 1249 00:55:15,370 --> 00:55:17,980 you'd probably have a pretty good idea of what's going on. 1250 00:55:17,980 --> 00:55:19,750 Oh, this is some kind of pipeline. 1251 00:55:19,750 --> 00:55:21,460 I wonder where it begins. 1252 00:55:21,460 --> 00:55:22,870 Probably step zero. 1253 00:55:22,870 --> 00:55:24,220 I wonder what that is. 1254 00:55:24,220 --> 00:55:26,450 It's probably raw data. 1255 00:55:26,450 --> 00:55:26,950 Right? 1256 00:55:26,950 --> 00:55:28,270 It's that simple. 1257 00:55:28,270 --> 00:55:29,680 And this just makes it so easy. 1258 00:55:29,680 --> 00:55:31,600 The more you can get stuff to people 1259 00:55:31,600 --> 00:55:33,520 and they can just understand it without 1260 00:55:33,520 --> 00:55:37,150 any additional explanation, that's incredibly valuable. 1261 00:55:37,150 --> 00:55:40,838 And again, allows your teams to contribute quickly to projects. 1262 00:55:40,838 --> 00:55:42,880 And then, finally, we'll add a little thing here, 1263 00:55:42,880 --> 00:55:44,870 which is a data file list. 1264 00:55:44,870 --> 00:55:47,860 This is a list of the full names of all the data 1265 00:55:47,860 --> 00:55:51,610 files because, especially on our systems, 1266 00:55:51,610 --> 00:55:54,310 people are often processing lots of data, which 1267 00:55:54,310 --> 00:55:56,920 contain millions of files. 1268 00:55:56,920 --> 00:56:00,480 If you're going to be using many processors to do that, 1269 00:56:00,480 --> 00:56:03,950 it really punishes the file system 1270 00:56:03,950 --> 00:56:08,650 if you ask 1,000 processors to all traverse a data directory 1271 00:56:08,650 --> 00:56:11,080 and build up the file name so they can figure out which 1272 00:56:11,080 --> 00:56:13,180 files they should each process. 1273 00:56:13,180 --> 00:56:14,770 Really, that's actually a great way 1274 00:56:14,770 --> 00:56:17,080 to get kicked off a computer is to have 1275 00:56:17,080 --> 00:56:21,700 1,000 processes all doing LS-L on all 1276 00:56:21,700 --> 00:56:25,670 the folders in a new directory with millions of files. 1277 00:56:25,670 --> 00:56:28,580 Right, it will bring the file system to its knees. 1278 00:56:28,580 --> 00:56:30,820 So if you just have this one file 1279 00:56:30,820 --> 00:56:32,290 that you generate once that lists 1280 00:56:32,290 --> 00:56:35,410 all the names of the files, then every single process 1281 00:56:35,410 --> 00:56:37,390 can just read into that file, pick 1282 00:56:37,390 --> 00:56:40,690 which files they want to read, and then do their work, 1283 00:56:40,690 --> 00:56:44,760 and then even know what the names to write the files are. 1284 00:56:44,760 --> 00:56:47,020 And that's, really, a very efficient way. 1285 00:56:47,020 --> 00:56:50,530 It dramatically speeds up your processing 1286 00:56:50,530 --> 00:56:53,938 and keeps the pressure on the file system at a minimum. 1287 00:56:53,938 --> 00:56:56,230 In addition, if you just want to look at what you have, 1288 00:56:56,230 --> 00:56:59,290 you don't have to type LS to list the directory, which 1289 00:56:59,290 --> 00:57:00,220 might ls while. 1290 00:57:00,220 --> 00:57:03,400 You can just look at this file and see what do I have. 1291 00:57:03,400 --> 00:57:03,900 Right? 1292 00:57:03,900 --> 00:57:06,540 It's very convenient to be able to know what the data sets are. 1293 00:57:06,540 --> 00:57:08,290 And then you can go in there, and you see, 1294 00:57:08,290 --> 00:57:12,430 oh, I see all these, you know, time stamps and sources. 1295 00:57:12,430 --> 00:57:13,390 I know what I have. 1296 00:57:13,390 --> 00:57:13,890 Right? 1297 00:57:13,890 --> 00:57:17,150 It makes it really easy for people to get through this. 1298 00:57:17,150 --> 00:57:20,410 So finally, I want to talk about using and sharing. 1299 00:57:20,410 --> 00:57:24,400 And so, again, co-design is using and sharing together. 1300 00:57:24,400 --> 00:57:28,510 Most data that you will deal with starts out unusable 1301 00:57:28,510 --> 00:57:30,460 and unshareable, that is, the data 1302 00:57:30,460 --> 00:57:33,460 is not suited for the purpose that you wanted to use it for, 1303 00:57:33,460 --> 00:57:36,890 and you do not have permission to share with anybody. 1304 00:57:36,890 --> 00:57:39,400 And so, if you understand the data and its purpose, 1305 00:57:39,400 --> 00:57:42,520 you can get through both of those problems. 1306 00:57:42,520 --> 00:57:47,480 So, really practical aspects, here, of data sharing. 1307 00:57:47,480 --> 00:57:50,290 We're going to talk about a few different roles that are 1308 00:57:50,290 --> 00:57:52,270 very common in data sharing. 1309 00:57:52,270 --> 00:57:55,490 The first is the keeper of the data. 1310 00:57:55,490 --> 00:57:58,120 This is the actual person who has the ability 1311 00:57:58,120 --> 00:58:00,250 to make a copy of the data. 1312 00:58:00,250 --> 00:58:02,710 You can talk to lots of people, but until you 1313 00:58:02,710 --> 00:58:06,610 find the keeper of the data, you haven't really gotten anywhere. 1314 00:58:06,610 --> 00:58:07,110 Right? 1315 00:58:07,110 --> 00:58:09,670 Because lots-- Oh, yes, we have data, and we can share it, 1316 00:58:09,670 --> 00:58:11,170 or we can give it to you, et cetera. 1317 00:58:11,170 --> 00:58:14,050 And there's someone who can eventually type a copy 1318 00:58:14,050 --> 00:58:16,090 command to the data. 1319 00:58:16,090 --> 00:58:17,562 You want to find that person. 1320 00:58:17,562 --> 00:58:19,520 That's the person you really need to work with. 1321 00:58:19,520 --> 00:58:22,030 They're the one person who can really, really help you 1322 00:58:22,030 --> 00:58:25,360 because they actually have hands on access to the data. 1323 00:58:25,360 --> 00:58:27,400 So you want to find that. 1324 00:58:27,400 --> 00:58:31,300 Then, you want to get a sample or a copy of the data 1325 00:58:31,300 --> 00:58:35,350 to your team to assess. 1326 00:58:35,350 --> 00:58:39,870 It is always good to do good opsec, which 1327 00:58:39,870 --> 00:58:44,020 is good operational security, which means fewer accounts, 1328 00:58:44,020 --> 00:58:46,520 fewer people having access is better. 1329 00:58:46,520 --> 00:58:49,360 A lot of times people say, oh, here's the data. 1330 00:58:49,360 --> 00:58:54,220 Give me a list of names that you want to have access to it. 1331 00:58:54,220 --> 00:58:57,400 If you provide them one name, that's often pretty easy 1332 00:58:57,400 --> 00:58:59,340 to get approved. 1333 00:58:59,340 --> 00:59:03,850 If you provide 10 names, now questions will happen. 1334 00:59:03,850 --> 00:59:07,160 Why are we giving 10 people access to this data? 1335 00:59:07,160 --> 00:59:10,070 You don't really need 10 people to have access to the data. 1336 00:59:10,070 --> 00:59:14,200 You can just have one person get the data on behalf of the team. 1337 00:59:14,200 --> 00:59:16,090 That's relatively easy to prove. 1338 00:59:16,090 --> 00:59:19,090 You start giving a list of 20 names have access data, 1339 00:59:19,090 --> 00:59:21,060 now there's all kinds of questions, 1340 00:59:21,060 --> 00:59:24,490 and that may just end the exercise right there. 1341 00:59:24,490 --> 00:59:28,300 And once the question starts happening, you'll discover, 1342 00:59:28,300 --> 00:59:31,270 well, maybe we've lost that one, and we need 1343 00:59:31,270 --> 00:59:33,430 to move on to someone else. 1344 00:59:33,430 --> 00:59:36,520 And guess what, that keeper, do they like you now 1345 00:59:36,520 --> 00:59:38,290 that they've raised all kinds of questions 1346 00:59:38,290 --> 00:59:39,998 because of something they thought was OK? 1347 00:59:39,998 --> 00:59:40,810 Well, now we're-- 1348 00:59:40,810 --> 00:59:43,190 No, you may no longer have a friend. 1349 00:59:43,190 --> 00:59:43,690 Right? 1350 00:59:43,690 --> 00:59:46,120 So you've lost a friend, you've lost a source of data, 1351 00:59:46,120 --> 00:59:49,720 so minimizing the access to the data, 1352 00:59:49,720 --> 00:59:51,610 getting just the people access to it who 1353 00:59:51,610 --> 00:59:55,240 need you to make your next step is a really good practice. 1354 00:59:55,240 --> 00:59:56,800 I have been that keeper. 1355 00:59:56,800 --> 00:59:59,050 I have been the person who has to improve things, 1356 00:59:59,050 --> 01:00:01,720 and nothing makes me shut something down 1357 01:00:01,720 --> 01:00:04,630 quicker than when someone says, well, I want 20 people to have 1358 01:00:04,630 --> 01:00:05,130 accesses. 1359 01:00:05,130 --> 01:00:07,560 I'm like, well, no. 1360 01:00:07,560 --> 01:00:10,060 Zero works for me. 1361 01:00:10,060 --> 01:00:11,803 That's a number in 20. 1362 01:00:11,803 --> 01:00:14,205 (LAUGHS) Right? 1363 01:00:14,205 --> 01:00:15,580 That's one of the numbers, there. 1364 01:00:15,580 --> 01:00:16,300 How about zero? 1365 01:00:16,300 --> 01:00:17,620 That works for me. 1366 01:00:17,620 --> 01:00:20,230 So then, once you get a copy of the data, 1367 01:00:20,230 --> 01:00:23,560 you then convert a sample of it to a tabular form, 1368 01:00:23,560 --> 01:00:27,190 you can identify the useful columns or features in that, 1369 01:00:27,190 --> 01:00:29,410 and then you can get rid of the rest. 1370 01:00:29,410 --> 01:00:31,570 That's a huge service to your community 1371 01:00:31,570 --> 01:00:33,850 is if you have a volume of data, and you know, like, 1372 01:00:33,850 --> 01:00:36,410 all of this stuff is not going to be useful. 1373 01:00:36,410 --> 01:00:38,350 Now you're just giving people useful data. 1374 01:00:38,350 --> 01:00:39,880 That's really, really valuable. 1375 01:00:39,880 --> 01:00:41,160 That's sort of a -- 1376 01:00:41,160 --> 01:00:43,360 And since you're talking the keeper of the data, 1377 01:00:43,360 --> 01:00:45,100 they may have the most information 1378 01:00:45,100 --> 01:00:48,100 about what these different columns or features mean. 1379 01:00:48,100 --> 01:00:50,050 That's not a conversation that a lot of people 1380 01:00:50,050 --> 01:00:51,350 are going to get to have. 1381 01:00:51,350 --> 01:00:53,600 And so, if you get to help your whole team by saying, 1382 01:00:53,600 --> 01:00:55,990 oh we know these columns mean this. 1383 01:00:55,990 --> 01:00:58,340 They're not relevant, let's get rid of them. 1384 01:00:58,340 --> 01:01:00,010 That's a huge service to your community. 1385 01:01:00,010 --> 01:01:01,030 It's less data. 1386 01:01:01,030 --> 01:01:03,880 It makes it easier to release. 1387 01:01:03,880 --> 01:01:08,230 You then, maybe, will, if there is data columns or features 1388 01:01:08,230 --> 01:01:10,270 that are sensitive in some way, there's 1389 01:01:10,270 --> 01:01:13,810 a variety of techniques you can use to minimize, anonymize, 1390 01:01:13,810 --> 01:01:17,110 possibly simulate, or use surrogate information to make 1391 01:01:17,110 --> 01:01:19,060 it so that you can share that data. 1392 01:01:19,060 --> 01:01:21,160 And there's lots of techniques, but those 1393 01:01:21,160 --> 01:01:23,960 are only applicable when you know your application. 1394 01:01:23,960 --> 01:01:26,500 So if I look at this data, and I see this one column that 1395 01:01:26,500 --> 01:01:29,950 might be sensitive but valuable for my application, 1396 01:01:29,950 --> 01:01:32,380 I can often come up with a minimization 1397 01:01:32,380 --> 01:01:34,240 scheme that will work for that. 1398 01:01:34,240 --> 01:01:35,980 Because all data analysis is based 1399 01:01:35,980 --> 01:01:38,597 on tables, which is mathematically just matrices, 1400 01:01:38,597 --> 01:01:40,180 one of the most fundamental properties 1401 01:01:40,180 --> 01:01:42,610 of matrices that we really, really like is 1402 01:01:42,610 --> 01:01:44,620 that reordering the rows or the columns 1403 01:01:44,620 --> 01:01:46,120 doesn't change any of the math. 1404 01:01:46,120 --> 01:01:49,210 That's kind of, like, the big deal of matrices and tables. 1405 01:01:49,210 --> 01:01:51,292 You can move the row, reorder the rows, 1406 01:01:51,292 --> 01:01:53,500 reorder the columns, and everything should still just 1407 01:01:53,500 --> 01:01:56,080 work, which means if you relabel the columns 1408 01:01:56,080 --> 01:02:00,220 or leave relabel the rows, things should still just work, 1409 01:02:00,220 --> 01:02:04,320 and you'll be able to do the science you need. 1410 01:02:04,320 --> 01:02:07,690 So a lot of times, when you share data with researchers, 1411 01:02:07,690 --> 01:02:10,660 information that normally you would delete, 1412 01:02:10,660 --> 01:02:13,050 that you would think would be valuable for someone else-- 1413 01:02:13,050 --> 01:02:14,770 The researcher is like, I don't care. 1414 01:02:14,770 --> 01:02:15,270 Right? 1415 01:02:17,680 --> 01:02:19,780 As long as it doesn't change something fundamental 1416 01:02:19,780 --> 01:02:22,608 about the structure of the data, I can work with that. 1417 01:02:22,608 --> 01:02:24,400 And again, that's something you would never 1418 01:02:24,400 --> 01:02:26,980 know unless you have an end purpose in mind. 1419 01:02:26,980 --> 01:02:30,580 So then, you want to obtain preliminary approval to take-- 1420 01:02:30,580 --> 01:02:32,633 I have a data set, I've worked with the keeper, 1421 01:02:32,633 --> 01:02:35,050 is it OK for me to share this with a broader set of people 1422 01:02:35,050 --> 01:02:37,180 to make sure that they like it too? 1423 01:02:37,180 --> 01:02:38,460 That's usually easy to do. 1424 01:02:38,460 --> 01:02:40,360 It's just a small sample. 1425 01:02:40,360 --> 01:02:42,280 And then, you can test with your users. 1426 01:02:42,280 --> 01:02:45,190 You repeat this process until you, basically, have data. 1427 01:02:45,190 --> 01:02:47,145 It's like, yeah, this is what we want, 1428 01:02:47,145 --> 01:02:49,480 this is what we want to share, and then, you can sort of 1429 01:02:49,480 --> 01:02:51,460 go through the process of how do we really get 1430 01:02:51,460 --> 01:02:53,560 an agreement in place to do that. 1431 01:02:53,560 --> 01:02:55,150 You also, then, create the folder 1432 01:02:55,150 --> 01:02:58,690 naming structure for scaling out to more data files. 1433 01:02:58,690 --> 01:03:00,880 And then finally, what's great is 1434 01:03:00,880 --> 01:03:04,510 if you can then automate this process at the data owner's 1435 01:03:04,510 --> 01:03:05,620 site. 1436 01:03:05,620 --> 01:03:07,240 It's tremendously valuable. 1437 01:03:07,240 --> 01:03:10,540 Then you have just gifted them a data architecture 1438 01:03:10,540 --> 01:03:13,030 that makes their data AI ready. 1439 01:03:13,030 --> 01:03:15,550 And then, moving forward, they can now 1440 01:03:15,550 --> 01:03:19,640 share with you, within their own organization, or with others. 1441 01:03:19,640 --> 01:03:20,140 Right? 1442 01:03:20,140 --> 01:03:23,020 So this is incredibly valuable. 1443 01:03:23,020 --> 01:03:25,660 Right, we said data architecture and wrangling is often 1444 01:03:25,660 --> 01:03:27,040 80% of the effort. 1445 01:03:27,040 --> 01:03:29,530 You do this, you've now eliminated potentially 1446 01:03:29,530 --> 01:03:31,780 80% of the effort of anyone else who 1447 01:03:31,780 --> 01:03:34,370 ever wants to do data analysis or AI in this data. 1448 01:03:34,370 --> 01:03:34,870 Right? 1449 01:03:34,870 --> 01:03:39,160 This is a phenomenally huge win for anyone 1450 01:03:39,160 --> 01:03:42,400 that you partnered with is to give them a very simple data 1451 01:03:42,400 --> 01:03:44,680 architecture they can maintain going forward. 1452 01:03:44,680 --> 01:03:49,670 It's a huge potential savings to them. 1453 01:03:49,670 --> 01:03:51,670 All right, so another role we want to talk about 1454 01:03:51,670 --> 01:03:55,600 is the data owner subject matter expert, or SME. 1455 01:03:55,600 --> 01:03:57,790 So we've talked about the keeper of the data. 1456 01:03:57,790 --> 01:03:59,500 There's often a subject matter expert 1457 01:03:59,500 --> 01:04:02,410 associated with the data that is an advocate. 1458 01:04:02,410 --> 01:04:04,900 The keeper is often like, hey, I maintain the data. 1459 01:04:04,900 --> 01:04:08,170 I do with these subject matter experts tell me to do. 1460 01:04:08,170 --> 01:04:09,560 But there's someone out there who 1461 01:04:09,560 --> 01:04:12,760 wants to help you get the data because they see it's 1462 01:04:12,760 --> 01:04:14,950 going to be valuable to you. 1463 01:04:14,950 --> 01:04:20,020 And all these AI data products are generally best made 1464 01:04:20,020 --> 01:04:21,550 by people with AI knowledge. 1465 01:04:21,550 --> 01:04:24,370 So we are highly motivated to engage 1466 01:04:24,370 --> 01:04:26,140 with these subject matter experts, 1467 01:04:26,140 --> 01:04:28,600 pull them into the AI community so they 1468 01:04:28,600 --> 01:04:30,430 can see the best practices. 1469 01:04:30,430 --> 01:04:32,740 That will allow them to understand 1470 01:04:32,740 --> 01:04:36,190 which data they have, how it might be useful to AI, 1471 01:04:36,190 --> 01:04:39,130 and how to prepare it, so that's better. 1472 01:04:39,130 --> 01:04:42,100 So whenever we get engaged with subject matter experts 1473 01:04:42,100 --> 01:04:44,560 at other communities, at data owners, 1474 01:04:44,560 --> 01:04:46,870 we really want to pull them into the community, 1475 01:04:46,870 --> 01:04:48,430 have them come to our environments 1476 01:04:48,430 --> 01:04:52,260 so they can see what's going on and help us identify new data 1477 01:04:52,260 --> 01:04:54,910 sets and help make those data sets 1478 01:04:54,910 --> 01:04:58,870 appropriate for our consumption. 1479 01:04:58,870 --> 01:05:01,630 So one of the big things you'll run into 1480 01:05:01,630 --> 01:05:05,070 is that there are concerns with sharing data. 1481 01:05:05,070 --> 01:05:07,000 That is, there's a lot of confusion 1482 01:05:07,000 --> 01:05:11,650 about the liability of sharing data with researchers. 1483 01:05:11,650 --> 01:05:15,250 And data owners often have to think about the lowest 1484 01:05:15,250 --> 01:05:16,510 common denominator. 1485 01:05:16,510 --> 01:05:18,250 If it's a company, they have to think 1486 01:05:18,250 --> 01:05:21,040 about US and EU and other requirements, 1487 01:05:21,040 --> 01:05:22,420 all kinds of different frameworks 1488 01:05:22,420 --> 01:05:24,550 for thinking about that. 1489 01:05:24,550 --> 01:05:27,610 The good news is there are standard practices that 1490 01:05:27,610 --> 01:05:30,700 meet these requirements. 1491 01:05:30,700 --> 01:05:33,400 And I'm going to list what these standard practices are. 1492 01:05:33,400 --> 01:05:35,380 We have worked with many folks who 1493 01:05:35,380 --> 01:05:38,380 have tried to share data successfully, 1494 01:05:38,380 --> 01:05:42,100 and these are the properties of successful data sharing 1495 01:05:42,100 --> 01:05:43,180 activities. 1496 01:05:43,180 --> 01:05:46,450 And the activities that don't do these things often 1497 01:05:46,450 --> 01:05:47,710 are unsuccessful. 1498 01:05:47,710 --> 01:05:49,790 We didn't invent any of this. 1499 01:05:49,790 --> 01:05:53,560 This is just our observations from working with many people 1500 01:05:53,560 --> 01:05:55,300 who've tried to share data. 1501 01:05:55,300 --> 01:05:59,080 One, you want data available in curated repositories. 1502 01:05:59,080 --> 01:06:02,590 Just taking data and posting it somewhere 1503 01:06:02,590 --> 01:06:08,890 without proper curation really does no good for anyone. 1504 01:06:08,890 --> 01:06:11,230 You want to use standard and anonymization 1505 01:06:11,230 --> 01:06:12,560 methods where needed. 1506 01:06:12,560 --> 01:06:14,890 Hashing, sampling, simulation, I've already 1507 01:06:14,890 --> 01:06:19,990 talked about those, but those do work, particularly when they 1508 01:06:19,990 --> 01:06:23,470 are coupled with access requires registration 1509 01:06:23,470 --> 01:06:25,690 with a repository and legitimate need. 1510 01:06:25,690 --> 01:06:29,410 So there's a data agreement associated 1511 01:06:29,410 --> 01:06:34,150 with getting the data in some way, shape, or form. 1512 01:06:34,150 --> 01:06:37,840 And those two together really, really work well. 1513 01:06:37,840 --> 01:06:40,685 And you also want to focus on getting 1514 01:06:40,685 --> 01:06:42,310 the data to the people who are actually 1515 01:06:42,310 --> 01:06:43,930 going to add value to it. 1516 01:06:43,930 --> 01:06:46,390 So one of the misnomers is, oh, we'll just take data, 1517 01:06:46,390 --> 01:06:48,820 and we'll make it available to all 10 million souls 1518 01:06:48,820 --> 01:06:52,390 on the planet earth, when, really, less than 100 1519 01:06:52,390 --> 01:06:56,140 could ever actually add value to us. 1520 01:06:56,140 --> 01:06:59,500 So why are we going through all the trouble and the issues 1521 01:06:59,500 --> 01:07:02,290 associated with making data available to anyone, when 1522 01:07:02,290 --> 01:07:05,860 really we only want to get that data to people with the skills 1523 01:07:05,860 --> 01:07:09,490 and legitimate research needs to make use of it? 1524 01:07:09,490 --> 01:07:12,820 And so, by setting up curated repositories, 1525 01:07:12,820 --> 01:07:14,300 people then apply-- 1526 01:07:14,300 --> 01:07:15,880 I want to get access to data-- 1527 01:07:15,880 --> 01:07:17,920 they talk about what their need is. 1528 01:07:17,920 --> 01:07:19,330 That sets, sort of, a swim lane. 1529 01:07:19,330 --> 01:07:21,495 You can't just do anything you want with the data. 1530 01:07:21,495 --> 01:07:23,620 When you say, I'm going to use it for this purpose, 1531 01:07:23,620 --> 01:07:25,507 you have to stick to those guidelines. 1532 01:07:25,507 --> 01:07:28,090 If you want to change them, you have to go back and say, look, 1533 01:07:28,090 --> 01:07:29,980 I want to do something different. 1534 01:07:29,980 --> 01:07:31,390 If you're a legitimate researcher 1535 01:07:31,390 --> 01:07:34,540 with legitimate research needs, that's usually pretty easy 1536 01:07:34,540 --> 01:07:36,230 to prove. 1537 01:07:36,230 --> 01:07:39,950 And then, recipients agree not to repost the whole data set 1538 01:07:39,950 --> 01:07:41,410 or to deanonymize it. 1539 01:07:41,410 --> 01:07:43,840 This is what makes anonymization work. 1540 01:07:43,840 --> 01:07:46,120 Not that the anonymization or the encryption 1541 01:07:46,120 --> 01:07:48,130 is so strong it's unbreakable, it's 1542 01:07:48,130 --> 01:07:50,980 that people agree not to reverse it. 1543 01:07:50,980 --> 01:07:53,740 And the good thing about working with legitimate researchers 1544 01:07:53,740 --> 01:07:57,640 is they're highly motivated not to break their agreements. 1545 01:07:57,640 --> 01:08:00,220 If you're at MIT or any of these universities 1546 01:08:00,220 --> 01:08:02,260 and you break your data sharing agreements, 1547 01:08:02,260 --> 01:08:05,650 that is professionally very detrimental to you 1548 01:08:05,650 --> 01:08:09,430 because it is very detrimental to the organization. 1549 01:08:09,430 --> 01:08:11,770 No research or organization ever wants 1550 01:08:11,770 --> 01:08:16,720 to have a reputation that it does not respect its data usage 1551 01:08:16,720 --> 01:08:17,560 agreements. 1552 01:08:17,560 --> 01:08:20,750 That will put that research organization out of business. 1553 01:08:20,750 --> 01:08:23,560 So there's a lot of incentive for legitimate researchers 1554 01:08:23,560 --> 01:08:25,810 to honor these agreements, which may not 1555 01:08:25,810 --> 01:08:29,500 be the same for just general people who are working 1556 01:08:29,500 --> 01:08:31,880 in other types of environments. 1557 01:08:31,880 --> 01:08:35,410 So they can publish their analysis and data examples 1558 01:08:35,410 --> 01:08:38,279 as necessary, they agree to cite the repository 1559 01:08:38,279 --> 01:08:40,609 and provide the publications back, 1560 01:08:40,609 --> 01:08:43,740 and the repository can curate enriched products. 1561 01:08:43,740 --> 01:08:47,950 So if a researcher discovers some derivative product-- 1562 01:08:47,950 --> 01:08:50,779 Look, I can apply this to all the data. 1563 01:08:50,779 --> 01:08:54,609 I'd like to have that be posted somewhere. 1564 01:08:54,609 --> 01:08:57,520 They can often use the original repository to do that. 1565 01:08:57,520 --> 01:09:00,760 And again, we encourage everyone to encourage people 1566 01:09:00,760 --> 01:09:02,850 to follow these guidelines. 1567 01:09:02,850 --> 01:09:05,510 They're very beneficial. 1568 01:09:05,510 --> 01:09:11,260 So I've talked about the data keeper. 1569 01:09:11,260 --> 01:09:14,170 I've talked about the subject matter expert. 1570 01:09:14,170 --> 01:09:16,930 Now, I'm going to talk about a final critical role, which 1571 01:09:16,930 --> 01:09:21,040 is the ISO, the information security officer. 1572 01:09:21,040 --> 01:09:25,600 In almost every single organization that has data, 1573 01:09:25,600 --> 01:09:27,670 there is an information security officer 1574 01:09:27,670 --> 01:09:31,510 that needs to sign off on making the data available to anyone 1575 01:09:31,510 --> 01:09:32,680 else. 1576 01:09:32,680 --> 01:09:36,399 They're the person whose neck is on the line 1577 01:09:36,399 --> 01:09:40,840 and is saying that it's OK to share this data, 1578 01:09:40,840 --> 01:09:44,950 that the benefits outweigh the potential risks. 1579 01:09:44,950 --> 01:09:49,537 And subject matter experts need to communicate the info 1580 01:09:49,537 --> 01:09:51,120 with the information security officer. 1581 01:09:51,120 --> 01:09:53,109 They're often the ones advocating for this data 1582 01:09:53,109 --> 01:09:57,640 to be released, but they don't know the language information 1583 01:09:57,640 --> 01:09:59,000 security officers. 1584 01:09:59,000 --> 01:10:01,430 And this causes huge problems. 1585 01:10:01,430 --> 01:10:05,350 And, I would say, 90% of data sharing efforts 1586 01:10:05,350 --> 01:10:11,440 die because of miscommunication between subject matter experts 1587 01:10:11,440 --> 01:10:13,750 and information security officers. 1588 01:10:13,750 --> 01:10:16,720 Because the information security officer fundamentally 1589 01:10:16,720 --> 01:10:19,240 needs to believe that the subject matter 1590 01:10:19,240 --> 01:10:22,870 expert understands the risks, understands basic security 1591 01:10:22,870 --> 01:10:26,740 concepts in order for them to trust them with their signature 1592 01:10:26,740 --> 01:10:28,760 to get it released. 1593 01:10:28,760 --> 01:10:31,630 And so, typically, what happens is 1594 01:10:31,630 --> 01:10:37,210 that an ISO needs a very basic information from the subject 1595 01:10:37,210 --> 01:10:39,190 matter expert in order to approve 1596 01:10:39,190 --> 01:10:41,330 the release of information. 1597 01:10:41,330 --> 01:10:42,530 What's the project? 1598 01:10:42,530 --> 01:10:43,520 What's the need? 1599 01:10:43,520 --> 01:10:45,590 Where is the data going to be? 1600 01:10:45,590 --> 01:10:48,730 Who's going to be using it for how long? 1601 01:10:48,730 --> 01:10:55,150 They are expecting one sentence answers to each of these. 1602 01:10:55,150 --> 01:10:57,190 A subject matter expert typically 1603 01:10:57,190 --> 01:11:01,150 replies with an entire research proposal. 1604 01:11:01,150 --> 01:11:04,090 And so, if I'm an ISO and I'm trying 1605 01:11:04,090 --> 01:11:06,760 to assess whether a subject matter expert knows anything 1606 01:11:06,760 --> 01:11:12,070 about security, and I gave them really easy softball questions, 1607 01:11:12,070 --> 01:11:13,780 and they come back with something 1608 01:11:13,780 --> 01:11:17,350 that clearly took them way more time and effort 1609 01:11:17,350 --> 01:11:19,810 that answers none of my questions, what's 1610 01:11:19,810 --> 01:11:23,200 my opinion going to be of that subject matter expert 1611 01:11:23,200 --> 01:11:25,720 and how much I should trust them? 1612 01:11:25,720 --> 01:11:27,065 It's going to be zero. 1613 01:11:27,065 --> 01:11:31,360 And from that moment onward, I'll be polite but, 1614 01:11:31,360 --> 01:11:33,340 you know, how about zero data? 1615 01:11:33,340 --> 01:11:34,860 That works for me. 1616 01:11:34,860 --> 01:11:36,430 (LAUGHS) Right? 1617 01:11:36,430 --> 01:11:38,030 That's a good number, right? 1618 01:11:38,030 --> 01:11:40,763 So this is really important. 1619 01:11:40,763 --> 01:11:42,430 And so, if there's one thing we can do-- 1620 01:11:42,430 --> 01:11:44,380 Because very few of you are ISOs. 1621 01:11:44,380 --> 01:11:46,780 And something that my team gets involved on a lot 1622 01:11:46,780 --> 01:11:50,380 is to intercept this conversation before it goes 1623 01:11:50,380 --> 01:11:55,630 bad, is we can help you work with ISOs, 1624 01:11:55,630 --> 01:11:57,640 work with people who have to release with data, 1625 01:11:57,640 --> 01:11:59,840 and help answer these questions in a proper way. 1626 01:11:59,840 --> 01:12:02,720 And so, I'm going to go through a few examples, here. 1627 01:12:02,720 --> 01:12:06,490 So, a really common question is, what is the data 1628 01:12:06,490 --> 01:12:07,750 you're seeking to share? 1629 01:12:07,750 --> 01:12:09,430 So this is a very common question 1630 01:12:09,430 --> 01:12:10,810 to get back from an ISO. 1631 01:12:10,810 --> 01:12:13,373 Describe the data to be shared, focusing on its risk 1632 01:12:13,373 --> 01:12:15,790 to the organization if it were to be accidentally released 1633 01:12:15,790 --> 01:12:18,400 to the public or otherwise misused. 1634 01:12:18,400 --> 01:12:20,396 Very common question. 1635 01:12:20,396 --> 01:12:23,380 And so, here's an example answer. 1636 01:12:23,380 --> 01:12:25,060 Very short answer. 1637 01:12:25,060 --> 01:12:27,100 The data was collected on these dates 1638 01:12:27,100 --> 01:12:29,543 at this location in accordance with our mission. 1639 01:12:29,543 --> 01:12:31,210 The risk has been assessed and addressed 1640 01:12:31,210 --> 01:12:33,310 by an appropriate combination of excision, 1641 01:12:33,310 --> 01:12:35,380 anonymization, and/or agreements. 1642 01:12:35,380 --> 01:12:37,570 The release to appropriate legitimate researchers 1643 01:12:37,570 --> 01:12:40,870 will further our mission and is endorsed by leadership. 1644 01:12:40,870 --> 01:12:43,960 Very, very, very short answer, but we've 1645 01:12:43,960 --> 01:12:47,120 covered a lot of ground in that short answer. 1646 01:12:47,120 --> 01:12:48,230 And I will explain. 1647 01:12:48,230 --> 01:12:52,480 Sentence one establishes the, identity finite scope, 1648 01:12:52,480 --> 01:12:55,840 and proper collection of the data in one sentence. 1649 01:12:55,840 --> 01:12:58,300 Sentence two establishes that risk was assessed 1650 01:12:58,300 --> 01:13:00,040 and that mitigations were taken. 1651 01:13:00,040 --> 01:13:02,920 Sentence three establishes the finite scope of the recipients, 1652 01:13:02,920 --> 01:13:06,200 an appropriate reason for release, and mission approval. 1653 01:13:06,200 --> 01:13:09,220 We've covered-- we've answered nine questions in three 1654 01:13:09,220 --> 01:13:12,400 sentences at the level of detail an ISO wants. 1655 01:13:12,400 --> 01:13:15,850 An ISO can always ask you more questions for more detail, 1656 01:13:15,850 --> 01:13:18,950 and they're usually looking for another one sentence answer. 1657 01:13:18,950 --> 01:13:21,220 And you build these up to be three, four, 1658 01:13:21,220 --> 01:13:23,290 five sentence things that cover broadly 1659 01:13:23,290 --> 01:13:24,880 what you're wanting to do. 1660 01:13:24,880 --> 01:13:27,220 And you see here, you don't want to do anything 1661 01:13:27,220 --> 01:13:30,095 that requires the ISO to have detailed technical knowledge 1662 01:13:30,095 --> 01:13:30,970 of what you're doing. 1663 01:13:30,970 --> 01:13:34,870 They're looking for overall scope, overall limiters 1664 01:13:34,870 --> 01:13:37,000 on what you're doing so that is somehow 1665 01:13:37,000 --> 01:13:41,050 contained to a reasonable risk. 1666 01:13:41,050 --> 01:13:44,740 Another question, here, that's very common is where, to whom 1667 01:13:44,740 --> 01:13:45,700 is the data going? 1668 01:13:45,700 --> 01:13:48,310 So please describe the intended recipients. 1669 01:13:48,310 --> 01:13:51,070 So an example answer is the data will be shared with researchers 1670 01:13:51,070 --> 01:13:53,480 at a specific set of institutions, 1671 01:13:53,480 --> 01:13:55,630 that it will be processed on those institution's 1672 01:13:55,630 --> 01:13:58,810 own systems, meaning their institutions security policies, 1673 01:13:58,810 --> 01:14:01,060 which include password controlled access, 1674 01:14:01,060 --> 01:14:03,355 regular application of system updates, encryption 1675 01:14:03,355 --> 01:14:05,440 of mobile devices, such as laptops, 1676 01:14:05,440 --> 01:14:08,530 all provided access to data will be limited to personnel working 1677 01:14:08,530 --> 01:14:10,120 as part of this effort. 1678 01:14:10,120 --> 01:14:13,180 Again, we've covered a lot of ground very simply. 1679 01:14:13,180 --> 01:14:15,430 We didn't really go into too many details 1680 01:14:15,430 --> 01:14:18,880 that are specific to this particular case. 1681 01:14:18,880 --> 01:14:22,330 And then, a final one is what controls are there 1682 01:14:22,330 --> 01:14:25,180 on further release, either policy or legal? 1683 01:14:25,180 --> 01:14:28,094 So an example is an acceptable use guidelines 1684 01:14:28,094 --> 01:14:30,302 that prohibit attempting to deanonymize the data will 1685 01:14:30,302 --> 01:14:32,344 be provided to all personnel working on the data, 1686 01:14:32,344 --> 01:14:34,302 publication guidelines have been agreed to that 1687 01:14:34,302 --> 01:14:36,810 allow for high-level statistical findings to be published, 1688 01:14:36,810 --> 01:14:39,760 but prohibit including any individual data records. 1689 01:14:39,760 --> 01:14:41,250 A set of notional records has been 1690 01:14:41,250 --> 01:14:43,750 provided that can be published an example of the data format 1691 01:14:43,750 --> 01:14:45,565 but is not part of the actual data set. 1692 01:14:45,565 --> 01:14:47,440 The research agreement requires that all data 1693 01:14:47,440 --> 01:14:48,700 be deleted at the end of the engagement 1694 01:14:48,700 --> 01:14:50,570 except those items retained for publication. 1695 01:14:50,570 --> 01:14:52,070 I'm not saying you have to use this. 1696 01:14:52,070 --> 01:14:53,487 But again, this is another example 1697 01:14:53,487 --> 01:14:55,660 of what ISOs are expecting. 1698 01:14:55,660 --> 01:14:59,245 You give them language like this, their confidence swells. 1699 01:14:59,245 --> 01:15:01,400 They're like, ah, I'm dealing with someone 1700 01:15:01,400 --> 01:15:04,890 that understands the risks I'm taking on their behalf. 1701 01:15:04,890 --> 01:15:07,430 And we can move forward in a very productive way. 1702 01:15:07,430 --> 01:15:09,870 And if you ever have questions about how to do this, 1703 01:15:09,870 --> 01:15:12,050 feel free to talk to my team or other people 1704 01:15:12,050 --> 01:15:14,750 or other ISOs who've had to deal with this. 1705 01:15:14,750 --> 01:15:17,060 We actually are working with a lot 1706 01:15:17,060 --> 01:15:19,730 of people on this language, a lot of people the government, 1707 01:15:19,730 --> 01:15:22,070 to sort of get these best practices out there. 1708 01:15:22,070 --> 01:15:24,710 We have socialized this with a fair number of experts 1709 01:15:24,710 --> 01:15:27,080 in this field, and there's general consensus 1710 01:15:27,080 --> 01:15:29,752 this is a very good method for helping you get data out. 1711 01:15:29,752 --> 01:15:31,460 So that brings me to the end of the talk. 1712 01:15:31,460 --> 01:15:35,120 So as I said before, data wrangling is very important, 1713 01:15:35,120 --> 01:15:37,640 significant part of the effort. 1714 01:15:37,640 --> 01:15:40,080 Analysis really relies on tabular data, 1715 01:15:40,080 --> 01:15:41,750 so that's a really good format. 1716 01:15:41,750 --> 01:15:45,262 You can do it in very simple, non-proprietary formats. 1717 01:15:45,262 --> 01:15:46,970 You're going to have to name your folders 1718 01:15:46,970 --> 01:15:49,310 and file something, you might as well 1719 01:15:49,310 --> 01:15:52,130 name something that's really generally usable. 1720 01:15:52,130 --> 01:15:55,190 And then, finally, co-designing, the sharing, 1721 01:15:55,190 --> 01:15:58,840 and what the intent is at the same time, that's doable. 1722 01:15:58,840 --> 01:16:02,300 Generalized release of data is very, very hard. 1723 01:16:02,300 --> 01:16:04,340 Tailored release for specific applications 1724 01:16:04,340 --> 01:16:08,037 is very, very doable if you have good communication from all 1725 01:16:08,037 --> 01:16:09,120 the stakeholders involved. 1726 01:16:09,120 --> 01:16:11,210 So these are just really practical things. 1727 01:16:11,210 --> 01:16:13,412 This isn't, sort of, fancy AI. 1728 01:16:13,412 --> 01:16:14,870 This is, sort of, bread and butter, 1729 01:16:14,870 --> 01:16:17,390 how do you actually get data analysis done. 1730 01:16:17,390 --> 01:16:21,010 And with that, I'm happy to take any questions.