Friday, October 03, 2014

Andreessen-Horowitz Again

Andreessen Horowitz had a second Academic Roundtable, gathering together academics, VCs, and people in startups to talk for a few days about a variety of topics.  I wasn't sure why I was invited the first time (which I wrote about last year), and am even less clear on why I was invited back, but I again had a very good time. 

Before a few words about the talks, the high order point:  I'm told that most/all of the talks will be put online over the coming weeks.  So if you're interested, you can experience them yourself.  I think it's great that they're doing that this year;  if you like the talks, you should tell them, so they keep doing it.  (If they don't have comments there, you can always comment here.)  Sadly, they don't seem to be up yet, but I'll post the links when they become available. 

Highlights would include a discussion of Bitcoin -- interesting to hear what Ed Felten, well-known Princeton professor and now ex-Chief Technologist of the FTC, thinks about the Bitcoin economy.  Dawn Song of Berkeley gave a general talk on security issues of the present and future, while Dan Boneh of Stanford gave a talk on the power of program obfuscation.  Raj Rajikumar of CMU gave some history and some peeks into the future of driverless cars -- it's not just Google, you know.  Tuomas Sandholm of CMU talked about his take on the starting of startups while still being an academic (based on now-multiple experiences), and Josh Bloom of UC Berkeley (and described the differences between writing papers about machine learning and building products using machine learning.

Of course, some heated discussion about the variety of issues that arise between academic work and transfer of technology to startups inevitably ensued.  (Maybe that's why I get invited.)  The key idea repeated by many (on both sides of the fence) in various forms was that businesses (and in particular startups) are very busy going down their road of development and product, and they may see many things out the sides of that road that are very interesting, but don't have time to explore off the road.  Academics are interested in things way off the road, often thinking of issues much further out in time-scale.  And (at least in my opinion) the role academics play is a good thing;  there (obviously) remains a lot of ways the two worlds can interact and cooperate. 

Thursday, September 18, 2014

On Academia vs. Industry.... MSR SVC Closing

I'm one of those professor types that ends up defending the joys of life as an academic versus a career in industry -- some of you who read this blog have probably seen me comment at Matt Welsh's blog or Daniel Lemire's blog/Google+ chain.  And to be clear I'm not anti-industry, I just want to make sure there's a fair discussion.

In that vein, I'm sad to hear that Microsoft Research Silicon Valley will be closing, as described in this article.  I know many people at MSRSV, and have visited there often.  It's a loss to the community to have it close.  I have sympathy for those who work there -- this must be stressful -- but this sympathy is happily diminished because I know the people there are so talented, they will quickly move on to other employment.  (When Digital Systems Research Center closed long ago, a number of the people I worked with in the lab just moved on to some small company that seemed to be just starting to really get going, called Google.  I wonder how it worked out for them.) 

After a moment of silence for the lab, I do feel it necessary to point out that this is one of the "issues" in the academia vs industry life choice.  The life cycle for companies moves rapidly.  That's not a bad thing, but it's a thing.  Disruptions like this are a non-trivial risk in industry, much less so in academia.  (Just look at the history of research labs.)  Again, I'm sure it will work out for the people at MSRSV, but any life shift like this -- even if it ends positively -- is stressful.  Without wanting to overstate the consequences, it's worth pointing out.

Monday, September 15, 2014

Simultaneous Enrollment

The Crimson has an op-ed on simultaneous enrollment, that I agree with.

Harvard does not like simultaneous enrollment, which means a student taking two classes that meet at the same time -- any time overlap counts (whether the whole class or half an hour once a week).  If you want to take a class via simultaneous enrollment, you have to petition the Administrative Board, and your professor is supposed to provide direct hour-per-hour instruction for the class you can't intend.  As a previous Crimson article states:
The Faculty Handbook requires that “direct and personal compensatory instruction” for simultaneous enrollment, but only recently has the Ad Board refused to recognize videotaped lectures as a stand-in for class time.
The article references that for the past several years the Ad Board has accepted recorded lectures, under some additional conditions, as a suitable proxy for the direct and personal compensatory instruction.  This apparently represented a change from their past position, and this last year, while I was on sabbatical, some Standing Committee on Education Policy decided to push back and say no more recorded substitutions. 

(An aside:  I don't know why they used the dated term "videotaped".  My lectures have been recorded and made available online;  these days, I'm not sure any "videotape" is actually involved in the process.)  

I believe I had a great deal to do with the Ad Board accepting recorded lectures the last several years.  My undergraduate class, CS 124, was recorded with very high quality production for the Harvard Extension School, and I made this video available to the students.  Some years ago, students starting approaching me wanting to do simultaneous enrollment for CS 124, and I thought it was fine, because of the availability of class lectures.  (That was, for instance, how the Extension students were taking the class.)  In particular, every year some number of students seemed to want to take CS 124 and a conflicting math class that overlapped, which made it very difficult for a small number of students who wanted to pursue both CS theory and mathematics.  (Both classes were "gateways" to more advanced classes;  at least for me, changing my class time would have just introduced other more challenging time conflicts;  neither class appeared poised to change times.)  But other class overlaps came up as well. 

I initially faced some opposition from the Ad Board when I supported the students' petitions for simultaneous enrollment because I suggested the recorded lectures were a suitable proxy.  They objected, and I objected to their objection.  After some back and forth, we found language which we managed to agree met the language of the faculty rules, with the outcome being the students could, in the end, rely on the recordings of class lectures.  I admit, I believe it helped that I had served on the Ad Board, and had a good working relationship with them, so they were willing to work with me to come to a mutually acceptable solution.  The framework we established was later used for CS 50 and other computer science classes, because I passed on my successful approach with the Ad Board to others.  (David Malan asked me for assistance on the issue, for example.)  There was, however, always some tension, with the Ad Board continuously questioning whether using class recordings to justify simultaneous enrollment was suitable.  Meanwhile, the number of such petitions kept growing. 

Apparently, while I was sabbaticalling, various powers that be have acted to tighten the rules for simultaneous enrollment, and disallow my previous framework.  Interestingly, CS 50 -- the class with the most requests for simultaneous enrollment -- seems to have acquired a blanket exception, which I am happy for, but still leaves the rest of us facing unhappy students who often have to deal with ugly situations.

What I find frustrating is that, to my knowledge, this change is not based on any data about student performance, or any understanding of student behavior.  It's based on a presumption that students are supposed to sit in classes to learn.  David Malan has some great data showing this presumption doesn't seem to have a basis in reality.  Students in CS50 taking the class under simultaneous enrollment don't do any worse than other students.  In fact, if I recall the data right, they do better than average;  this would not be surprising, as they are a self-selected group with apparently high interest in the subject.  David also has great data showing how lecture attendance steadily drops over the semester.  If a substantial fraction of the students are missing class and watching the video anyway, what's the point of blocking simultaneous enrollment?  Maybe because of all this data at David's command CS 50 got its exemption, but I'm sure if we go back we'd find similar findings for CS 124 and other classes.   

Professors have always had the right to refuse simultaneous enrollment petitions for their class, so if the professor expects student attendance, there is no problem.  I don't think it's a radical suggestion to say that for recorded classes, professors and students should have the ability to jointly decide if simultaneous enrollment is a suitable solution to the small number of inevitable class scheduling conflicts.  I'm not sure what combination of faculty and administrative busybodies decided they had to fix what I see as a non-problem, but I'm sure their rule-making (or in this case rule-interpreting), while it only affect a small number of students, will be a significant net negative.  Kudos to the Crimson for pointing out that this is a poor decision, and I only wish I was optimistic that it could be quickly reversed.  

Thursday, September 11, 2014


Preliminary course enrollment numbers are in.  Our new theory course, CS 125 (rapid introduction to theory), has 32 undergrads and 5 others, for an enrollment of 37.  Salil and I had predicted and desired in the 20-40 range for the first year of this course, so we're on target.  We hadn't thought about CS 125 being a class for grad students, but it actually makes sense.  Entering CS grad students who haven't had background in theory, or grad students from other related areas (statistics, mathematics, etc.) who want a fast-paced introduction to the basics of CS theory, would both fit in well for this class.  Maybe next year we'll even advertise it for grad students as well.

Our mainstream required-for-majors theory course, CS 121 (the "here's Turing machines and undecidability and NP-completeness" course), has an enrollment of just over 150, the largest it's ever been.  Given that CS 125 was supposed to draw some students from 121, that's even more amazing.  

And the standard CS introductory course, CS 50, has over 800 undergraduates (and over 850 total) signed up, making it now the largest class at Harvard.  This was so noteworthy that the Crimson had an article this morning about it.  As usual, you can find a great quote from Harry Lewis: 
“Harvard students are smart people,” said Harry R. Lewis ’68, former dean of the College and current director of undergraduate studies for Computer Science. “They have figured out that in pretty much every area of study, computational methods and computational thinking are going to be important to the future.”
So the growth continues, at least for another year.

Wednesday, September 10, 2014

You Can Rename My Family for.....

An interesting article in today's Crimson about the background to the announcement that Harvard's School of Public Health will be re-named the Harvard T. H. Chan School of Public Health, for a $350 million donation.  Feel free to comment on your feelings regarding the issue.  (The Kennedy School of Government notwithstanding, Harvard schools have not been "named" via donations historically, although certainly buildings and such have.)  I'm not publicly taking sides -- indeed, it seems like a wonderful debate challenge to figure out arguments on both sides of the issue as to whether naming schools in this way is a good or bad idea.  Even if you think it's a good idea, you might wonder about the price tag...

In particular, I can't help but wonder what the naming rights of the School of Engineering and Applied Sciences could go for?  In the course of our capital campaign, will I have the chance to find out?  Maybe we could get a bidding war going?  How much would SEAS have to get for me to feel good about having "Harvard's XXXX School of Engineering and Applied Sciences" on my letterhead for some appropriate name XXXX? 

At a baser level, Mitzenmacher has always been a problematic name.  Too long.  (On so many standardized forms, I end up as Michae Mitzenmache....)  Maybe I should offer myself up for naming rights.  Sadly, I don't think I'd get $350 million.  

Sunday, September 07, 2014

Teaching Sorting

For many years now, while teaching the introductory Algorithms and Data Structures, I have told students that we won't be starting with (or really covering) sorting and searching, because it's boring.  While that description is an exaggeration (the students see mergesort [an example of divide and conquer] and heapsort [heaps are important as an implementation of priority queues for Dijkstra's algorithm]), I've never thought that staring this class with a long unit on sorting/searching algorithms, which most textbooks like to start with, is the best use of time.

So it was a bit odd to find myself for our new "honors course", CS 125, spending the first day-and-a-half or so on sorting algorithms.  The key was that there was the right motivation for it.  We'll be tackling both algorithms/data structures and complexity, and the theme of how to model computation will play a big role throughout the course.  When we thought about what was the right way to introduce the class, and in particular this theme, sorting seemed like the right choice.

After very quickly refreshing the students on some basic comparison sorts (bubblesort and mergesort), we present the standard lower bound argument for comparison-based sorts.  But then we show how this lower bound can be broken, by assumptions about the data that allow one to go beyond comparisons to other operations.  Specifically, we raced through counting sort, bucket sort (for random data), and sorting via van Emde Boas trees.    (The students in particular seemed to be talking through the details of van Emde Boas trees at the class break, which I'd like to think because they were new and interesting, but possibly was due to the fact that it was the first time I'd ever presented them, and maybe my delivery for that topic needs some tuning.) 

Lo and behold, I find myself finding sorting interesting!  We didn't even get to more advanced interesting possibilities, like data-oblivious sorting or "bleeding edge" parallel sorting.  I feel I need to apologize to sorting, for mistakenly suggesting that it's a boring topic.  I still wouldn't want to start an algorithms/data structures class with multiple weeks of sorting and searching algorithms, but sorting is a natural way to introduce some key concepts in algorithms, and there's fun to be had in the large variety of approaches to sorting.

Thursday, August 28, 2014

Shout-out to Sabeti Lab

A shout-out today to my friend and colleague Pardis Sabeti (and her lab) for their Science article on the Ebola virus that appeared earlier today.  Pardis and her group have been studying the genetics of viral diseases, and in particular the Lassa virus.  So they were there and ready when the recent Ebola virus began and went to work.  They sequenced 99 Ebola virus genomes from 78 patients, and have analyzed the resulting information to gain insight into how the disease is spreading and mutating. They have released the genetic sequences to the public so that the information is available to groups who could use the sequencing to help find and test cures.  This is both very important work, and (sadly) very serious.  As reported, 5 co-authors of the study (health care workers, a lab technician, and the director of the national Lassa fever program in Sierra Leone) have died of Ebola before today's publication.  

Numerous news articles appeared today discussing this work, too many for me to link to.  But the SabetiLab page here contains a great deal of information about her research and various projects.  Her work in biology, which makes powerful use of statistics and computation, should serve as an inspiration for future scientists, and for people who want to use algorithmic and mathematical methods in ways to benefit the world. 

Wednesday, August 27, 2014

Update on SOCG and ACM

I am happy to have an update on the SOCG/STOC colocation issue that arose last week.  Or, better said, Jeff Erickson has an update, the short summary of which is that it looks like there has now been some very useful clarification.  The concerns of the ACM apparently are limited to direct financial support, and the conferences can (formally) co-locate.  I encourage you to read the note on Jeff's blog from Paul, and let me echo Jeff's statement here: 

"Needless to say, this is fantastic news!  I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."

So say we all.

Tuesday, August 26, 2014

New Article on arxiv on Equitability and MIC

We recently put on arxiv a new draft on "Theoretical Foundations of Equitability and the Maximal Information Coefficient".  This is some follow-on work to a paper that appeared in Science a couple of years ago, where we introduced the idea of equitability.  Essentially, in that Science paper (link to page where you can access the paper), we wanted a statistic that would give back, for samples from a noisy functional relationship, a score corresponding to the amount of noise (or, in that case, to the R^2 of the noisy data relative to the relevant noiseless function), regardless of the relationship type.  The idea was that this would be useful in data exploration settings, where we might have a large number of possible relationship pairs and in particular a number of non-trivially correlated relationships, and we'd want to score them, in some fair way across the possible types of relationships (linear, parabolic, sinusoidal, etc.), so that we could choose the most promising to look at.  We also wanted the statistic to do reasonable things for non-functional relationships.  And, finally, we wanted a pony.  (But we couldn't find a way to put that in the paper.)  The maximal information coefficient (MIC), which we built on top of mutual information, was our proposed statistic.

The paper has gotten some interest.  One thing that we heard was that people wanted a richer theoretical framework for these ideas.  So now we're finally delivering one.  It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project.  On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science.  We're eager for feedback as we hope to formally submit somewhere soon. 

In a couple of weeks we should have another paper out on the same topic that is more empirical.  Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.