I've always been impressed how the SIGCOMM conference accepts a very small number of papers, but many hundreds of people attend the conference. (This is true for some other conferences as well; SIGCOMM is just he one I happen to know best.) The conference is somehow an "attend-if-possible" venue for the networking community. Most other large conferences I know work instead by accepting a lot of papers. ISIT is a week-long event with 9 parallel sessions (in 2014). Location may certainly be helpful; ISIT was in Honolulu in 2014 and
Hong Kong in 2015; SIGCOMM will be in London this year. But it also isn't decisive; the Allerton conference has been steadily growing, so it's a few hundred people, even though Urbana-Champaign, where the conference is held, really can't be the biggest draw. (No offense to those who are there -- I always enjoy visiting!) In fact, really, at this point, Allerton has been so successful it has seemingly outgrown the Allerton conference center!

If you have ideas for what makes a conference a "must-attend" venue, I'd be interested in hearing them. The question has arisen because of a Windows on Theory post about changing the format for STOC 2017; for those of you on the theory side who haven't seen it, I'd recommend looking at the post and registering an opinion by comment or by e-mail to the appropriate parties, after you've considered the issue. The underlying question is what role should the STOC/FOCS conferences play in the world these days, and how might we change things to get them to play that role. I think another way of framing this is that I don't think STOC/FOCS aren't really "attend-if-possible" venues for much of the theory community -- particularly for certain subareas -- and the question is whether this can change. Again, I'd be happy for insights from other communities as well.

There are a large number of comments already. I'd say in terms of my personal opinion on what to do I'm most aligned with Alex Lopez-Ortiz and Eric Vigoda. Alex's point is straightforward -- accept more papers. I've been in the accept more papers camp for a while (a very old post on the topic suggests one way it could be done), but for some reason there seems to be huge numbers of people in the theory community against the idea, based on some sort of quality argument. For me, the tradeoffs are pretty simple; I generally prefer not to travel, so I need a good reason to go to a conference, and one very good reason is that I have a paper in the conference that I or a colleague is presenting. (I recognize this is not true for everyone, but being selective in travel is an issue that arises for people with families, or other regular obligations.) Clearly conferences such as SIGCOMM show that there are other paths possible, but I'm not clear on how to reproduce that kind of buy-in from the theory community. Eric's suggestion was that lots of theory conferences happen during the summer, so instead of trying other ways to create a larger "STOC festival", why not co-locate a number of conferences (that would have parallel sessions) that would get more of the theoretical computer science community together. That makes sense to me, and seems like it could be particularly beneficial for some of the smaller conferences.

But my personal opinion, while undoubtedly fascinating to some (most notably myself), is really not the point. The point is for the community to examine the current situation, think if there are ways to make it better, and then act on what they come up with. The bet way for this to happen is if people actively participate in discussions. So please, go opine! There's no reason the post shouldn't have 100+ comments, so comment away.

## Monday, May 18, 2015

## Friday, May 15, 2015

### New Work on Equitability and MIC

A few years ago, working with a diverse group of scientists, I was looking at a problem related to big data analysis. The setting was data exploration, where you are working with a high-dimensional dataset, and you are seeking pairs of dimensions that are related in interesting but a priori unknown ways; that is, maybe there's a linear relationship, but maybe there's a more complicated relationship (sinusoidal, parabolic, etc.), and you don't know what you're looking for in advance.

One statistical approach is to throw a so-called measure of dependence at the problem, which will tell you whether variables are related or unrelated. But in big data sets, you may expect to have lots of dependent pairs; what you really want is not just a "dependent-independent" test, but a scoring mechanism that ranks the extent of dependence appropriately over a wide range of possible relationships. This methodology seemed underserved by the literature. We dubbed what we were aiming for "equitability", and designed a statistic we dubbed MIC (maximal information coefficient), a "bucketed" version of mutual information, that seemed to be a good ranking mechanism for equitability. The work appeared in Science some years back (the link here will allow you to access it).

The work has, I think, been quite successful -- a number of people have used MIC in their research to analyze data sets, and the idea of equitability seems to be catching on. But after it came out, there were some questions and issues raised primarily by statisticians. We had always planned to go back and revisit some of these issues, and, after various delays caused by life, a group of us started back on it again. The project seemed to balloon, and we've been working on and off on it for at least a year now. (Some earlier, initial drafts pointing to where we were going are on the arxiv.) Finally, we've reached a stopping point, and we've just put up 3 papers on the arxiv, all of which are now being submitted to various journals. In this post, I'll outline what the papers cover, after the links.

First, people wanted further theoretical foundations for equitability. Our Science paper, being for a general audience, provided an intuitive definition, but we didn't define it as one would in a mathematics paper. (That was, I should be clear, intentional.) We thought our definition was pretty clear in plain English, but for some it was not sufficient; one group, in particular, seemed to ignore our explanations when in their own work they added restrictions to "equitability" and incorrectly ascribed them to us.

So the first paper formalizes this notion of equitability. As is usually the case, formalizing it turns out to be helpful. In particular, it shows that there are (at least) two natural ways of thinking about equitability. One way formalizes our original conception that an equitable statistic is a statistic that gives similar scores to relationships with the same amount of noise, even if the relationships are of different types. But a different way of viewing the formalization shows that equitability can naturally be seen as an extension of statistical power against a null hypothesis of independence (i.e., relationship strength = 0), to power against null hypotheses representing all levels of dependence (i.e. relationship strength <=x for any x). This view clarifies the relationship between equitability and power against independence, showing the former generalizes the latter. In our new papers we have some really nice "heat-map" style visualizations for viewing equitability performance results based on this view that I think are really useful in thinking about and understanding equitability in the paper.

Second, people wanted faster, better versions of our proposed scoring function, MIC. We ended up going back to the theory of MIC, examining further its relation to mutual information, and, perhaps unsurprisingly, doing so allowed us to make tangible practical advances. We ended up with algorithms for computing MIC that are both faster and more accurate. (Our original algorithm in the Science paper used a heuristic approach based on dynamic programming to approximate the MIC score from the data; our new approaches have both improved speed and accuracy.) We're hopeful that people using MIC will be able to switch over time to these improved algorithms. Connecting to the first paper, in re-examining MIC we also developed a variation of MIC (called TIC, for total information coefficient) that is better designed for achieving power against independence as opposed to equitability. We are hopeful that TIC may prove useful to people on its own, as well as in conjunction with MIC.

Third, people just wanted to see more comparisons. Well, really, here I think everyone just has their own favorite measure of dependence, and before they go switching to this new-fangled thing that appeared in Science of all places, they want to see more. Specifically, subsequent to that initial paper, there were people concerned about the power against independence of our methods -- although, again, to be clear, equitability is a larger notion that power against independence, so one might expect a method designed for equitability would not have as high power against independence as methods designed specifically for that purpose. Also, there were some people who claimed, based on very limited experiments, that mutual information estimation would be more equitable than MIC.

So the third paper is a large-scale empirical study. In terms of equitability, we find that MIC still seems to be the current best measure. (Mutual information sometimes does well -- in some very particular circumstances, it can do better than MIC, which is not especially surprising since MIC is a "scaled" version of mutual information -- but MIC still performs much better overall.) In terms of power against independence, we find that MICs power was underrated in previous studies. The discrepancy seemed to be that previous studies used MICs default parameter settings, which were designed for equitability performance, not power against independence; using different default parameters yields substantially better power against independence. (The new algorithms in the second paper that yield more accurate calculations improve the power further.) Finally, the TIC measure described in the second paper does even better, is easily computed when computing MIC scores (and so has negligible overhead if computing MIC scores), and appears to be comparable to other state-of-the-art measures in terms of power against independence. We also find that the new ways of computing MIC are indeed faster and more accurate than previous methods.

We're hoping these works, collectively, move the ball forward on this topic. We like seeing MIC being used, and hope people will start using it more when analyzing their large dimensional data sets. To be clear, perhaps tomorrow we will find that there are better scoring mechanisms than MIC for equitability; perhaps even better mutual information estimators will suffice. That would be great! (I think of it like clustering algorithms; there are lots of good clustering algorithms, different ones may be better suited to different situations. The more the merrier.) We also think equitability continues to be a useful framework, and that there's more to be done with equitability. Though for now our group may again take a breather on this topic, and see what arises from our current work.

One statistical approach is to throw a so-called measure of dependence at the problem, which will tell you whether variables are related or unrelated. But in big data sets, you may expect to have lots of dependent pairs; what you really want is not just a "dependent-independent" test, but a scoring mechanism that ranks the extent of dependence appropriately over a wide range of possible relationships. This methodology seemed underserved by the literature. We dubbed what we were aiming for "equitability", and designed a statistic we dubbed MIC (maximal information coefficient), a "bucketed" version of mutual information, that seemed to be a good ranking mechanism for equitability. The work appeared in Science some years back (the link here will allow you to access it).

The work has, I think, been quite successful -- a number of people have used MIC in their research to analyze data sets, and the idea of equitability seems to be catching on. But after it came out, there were some questions and issues raised primarily by statisticians. We had always planned to go back and revisit some of these issues, and, after various delays caused by life, a group of us started back on it again. The project seemed to balloon, and we've been working on and off on it for at least a year now. (Some earlier, initial drafts pointing to where we were going are on the arxiv.) Finally, we've reached a stopping point, and we've just put up 3 papers on the arxiv, all of which are now being submitted to various journals. In this post, I'll outline what the papers cover, after the links.

- Equitability, interval estimation, and statistical power. arXiv

- Measuring dependence powerfully and equitably.
arXiv

- An Empirical Study of Leading Measures of Dependence. arXiv

First, people wanted further theoretical foundations for equitability. Our Science paper, being for a general audience, provided an intuitive definition, but we didn't define it as one would in a mathematics paper. (That was, I should be clear, intentional.) We thought our definition was pretty clear in plain English, but for some it was not sufficient; one group, in particular, seemed to ignore our explanations when in their own work they added restrictions to "equitability" and incorrectly ascribed them to us.

So the first paper formalizes this notion of equitability. As is usually the case, formalizing it turns out to be helpful. In particular, it shows that there are (at least) two natural ways of thinking about equitability. One way formalizes our original conception that an equitable statistic is a statistic that gives similar scores to relationships with the same amount of noise, even if the relationships are of different types. But a different way of viewing the formalization shows that equitability can naturally be seen as an extension of statistical power against a null hypothesis of independence (i.e., relationship strength = 0), to power against null hypotheses representing all levels of dependence (i.e. relationship strength <=x for any x). This view clarifies the relationship between equitability and power against independence, showing the former generalizes the latter. In our new papers we have some really nice "heat-map" style visualizations for viewing equitability performance results based on this view that I think are really useful in thinking about and understanding equitability in the paper.

Second, people wanted faster, better versions of our proposed scoring function, MIC. We ended up going back to the theory of MIC, examining further its relation to mutual information, and, perhaps unsurprisingly, doing so allowed us to make tangible practical advances. We ended up with algorithms for computing MIC that are both faster and more accurate. (Our original algorithm in the Science paper used a heuristic approach based on dynamic programming to approximate the MIC score from the data; our new approaches have both improved speed and accuracy.) We're hopeful that people using MIC will be able to switch over time to these improved algorithms. Connecting to the first paper, in re-examining MIC we also developed a variation of MIC (called TIC, for total information coefficient) that is better designed for achieving power against independence as opposed to equitability. We are hopeful that TIC may prove useful to people on its own, as well as in conjunction with MIC.

Third, people just wanted to see more comparisons. Well, really, here I think everyone just has their own favorite measure of dependence, and before they go switching to this new-fangled thing that appeared in Science of all places, they want to see more. Specifically, subsequent to that initial paper, there were people concerned about the power against independence of our methods -- although, again, to be clear, equitability is a larger notion that power against independence, so one might expect a method designed for equitability would not have as high power against independence as methods designed specifically for that purpose. Also, there were some people who claimed, based on very limited experiments, that mutual information estimation would be more equitable than MIC.

So the third paper is a large-scale empirical study. In terms of equitability, we find that MIC still seems to be the current best measure. (Mutual information sometimes does well -- in some very particular circumstances, it can do better than MIC, which is not especially surprising since MIC is a "scaled" version of mutual information -- but MIC still performs much better overall.) In terms of power against independence, we find that MICs power was underrated in previous studies. The discrepancy seemed to be that previous studies used MICs default parameter settings, which were designed for equitability performance, not power against independence; using different default parameters yields substantially better power against independence. (The new algorithms in the second paper that yield more accurate calculations improve the power further.) Finally, the TIC measure described in the second paper does even better, is easily computed when computing MIC scores (and so has negligible overhead if computing MIC scores), and appears to be comparable to other state-of-the-art measures in terms of power against independence. We also find that the new ways of computing MIC are indeed faster and more accurate than previous methods.

We're hoping these works, collectively, move the ball forward on this topic. We like seeing MIC being used, and hope people will start using it more when analyzing their large dimensional data sets. To be clear, perhaps tomorrow we will find that there are better scoring mechanisms than MIC for equitability; perhaps even better mutual information estimators will suffice. That would be great! (I think of it like clustering algorithms; there are lots of good clustering algorithms, different ones may be better suited to different situations. The more the merrier.) We also think equitability continues to be a useful framework, and that there's more to be done with equitability. Though for now our group may again take a breather on this topic, and see what arises from our current work.

## Saturday, April 25, 2015

### Girls Who Code (Newton) Visit to Harvard

My friend David Miller is looking for instructors to help out with the Newton Girls who Code club. Here's an announcement, please connect with him if you're interested.

They visited Harvard last week -- David gave me permission to post his description of the visit. It seemed like a well-organized event -- thanks to the Harvard Women in Computer Science group for putting it together.

----

Last Friday, the Newton Girls Who Code club was welcomed by the Harvard Women in Computer Science at Harvard's Maxwell-Dworkin Laboratory. The students learned about the joint Harvard-MIT Robot Soccer team from the mechanics tech lead, Harvard junior Kate Donahue. She showed us last years fleet of robots (named after Greek and Roman gods), and described their work on preparing this years fleet (to be named after Harry Potter characters). Kate emphasized the interplay between computer vision, artificial intelligence, mechanical engineering, and distributed systems. Many of the robot parts are 3D printed -- a technology that the Newton GWC students will become more familiar with this fall as we integrate the Newton Library's 3D printer into the GWC activities.

After the robots demonstration, the students took part in a Q+A discussion with Harvard WiCS undergrads Hana Kim, Ann Hwang, and Adesola Sanusi. Our students asked great questions about our hosts' motivation and history with coding, the mechanics of being a college CS major, the role of gender in the CS community, the connections between computer science and other fields, and our hosts' vision for the future of computing. The WiCS undergrads are excellent role models and were enormously encouraging. They pronounced our students more than ready to take Harvard's most popular course, Introduction to Computer Science, and recommended they try out the free, online, EdX version today. It was an exhilarating afternoon!

They visited Harvard last week -- David gave me permission to post his description of the visit. It seemed like a well-organized event -- thanks to the Harvard Women in Computer Science group for putting it together.

----

Last Friday, the Newton Girls Who Code club was welcomed by the Harvard Women in Computer Science at Harvard's Maxwell-Dworkin Laboratory. The students learned about the joint Harvard-MIT Robot Soccer team from the mechanics tech lead, Harvard junior Kate Donahue. She showed us last years fleet of robots (named after Greek and Roman gods), and described their work on preparing this years fleet (to be named after Harry Potter characters). Kate emphasized the interplay between computer vision, artificial intelligence, mechanical engineering, and distributed systems. Many of the robot parts are 3D printed -- a technology that the Newton GWC students will become more familiar with this fall as we integrate the Newton Library's 3D printer into the GWC activities.

After the robots demonstration, the students took part in a Q+A discussion with Harvard WiCS undergrads Hana Kim, Ann Hwang, and Adesola Sanusi. Our students asked great questions about our hosts' motivation and history with coding, the mechanics of being a college CS major, the role of gender in the CS community, the connections between computer science and other fields, and our hosts' vision for the future of computing. The WiCS undergrads are excellent role models and were enormously encouraging. They pronounced our students more than ready to take Harvard's most popular course, Introduction to Computer Science, and recommended they try out the free, online, EdX version today. It was an exhilarating afternoon!

## Monday, April 20, 2015

### My (Positive) Apple Retail Repair Experience

I switched to Apple machines several years ago, and have been very happy with them generally.

One convenience is there is an Apple store in the mall about 10 minutes from my home, so when things go wrong, I know where to go. Thursday night a screen-related hardware glitch developed (which, to be clear, in this case was not what I would call a product defect; let's say it was a me defect). I ran over and they said it would probably need to be taken in for a repair.

Not being ready to hand over machine just yet, I went home to pull off some data, and then booked an appointment. The downside to the Apple store is you really need to book an appointment online, even for (or especially for) things involving repair. My walk-in Thursday I didn't even get up to the Genius Bar; one of the people up front just suggested what they thought was going on. And my Apple store, at least, is painfully busy, all the time. I got an appointment for Saturday afternoon, nothing available Friday. I went in Friday morning before heading to work on the off chance I could get it in then; they said the walk-in waiting time was 90 minutes to 2 hours. I should have expected; the walk-in waiting time always seems to be measured in hours.

That's the downside. The positive is, when you have an appointment, they are great. I had no wait when I went in Saturday, and in the past, if they're running behind, they've let me know if I have to wait for my appointment and estimated for how long. They are courteous and professional. They will let you know what they are doing as they test your machine, what needs to be done, and provide estimates for how long it will take. Their setup really engenders trust. I also have found their repair times reasonable and understandable. My machine is back to me and all is well Monday afternoon. (Thank you, Apple Store.)

I haven't had to deal with them often over the last decade -- I'd say a reasonable number of times considering the number of Apple machines I've had and in family use. Overall this experience is typical. There's anxiety in going in for any sort of repair process, but they did an excellent job.

Finally, I imagine this is not useful advice to people reading this blog, but do back up your data regularly and/or automatically. One of the first questions they always ask is whether you have a backup, and it's always a good feeling to know that if worst comes to worst you just put your state on a new machine. Someone else at the Genius Bar while I was there had their computer put back in working order but their data was lost; they weren't the sort of user that it was a real problem, but the occasional reminder that backups are important is always helpful.

One convenience is there is an Apple store in the mall about 10 minutes from my home, so when things go wrong, I know where to go. Thursday night a screen-related hardware glitch developed (which, to be clear, in this case was not what I would call a product defect; let's say it was a me defect). I ran over and they said it would probably need to be taken in for a repair.

Not being ready to hand over machine just yet, I went home to pull off some data, and then booked an appointment. The downside to the Apple store is you really need to book an appointment online, even for (or especially for) things involving repair. My walk-in Thursday I didn't even get up to the Genius Bar; one of the people up front just suggested what they thought was going on. And my Apple store, at least, is painfully busy, all the time. I got an appointment for Saturday afternoon, nothing available Friday. I went in Friday morning before heading to work on the off chance I could get it in then; they said the walk-in waiting time was 90 minutes to 2 hours. I should have expected; the walk-in waiting time always seems to be measured in hours.

That's the downside. The positive is, when you have an appointment, they are great. I had no wait when I went in Saturday, and in the past, if they're running behind, they've let me know if I have to wait for my appointment and estimated for how long. They are courteous and professional. They will let you know what they are doing as they test your machine, what needs to be done, and provide estimates for how long it will take. Their setup really engenders trust. I also have found their repair times reasonable and understandable. My machine is back to me and all is well Monday afternoon. (Thank you, Apple Store.)

I haven't had to deal with them often over the last decade -- I'd say a reasonable number of times considering the number of Apple machines I've had and in family use. Overall this experience is typical. There's anxiety in going in for any sort of repair process, but they did an excellent job.

Finally, I imagine this is not useful advice to people reading this blog, but do back up your data regularly and/or automatically. One of the first questions they always ask is whether you have a backup, and it's always a good feeling to know that if worst comes to worst you just put your state on a new machine. Someone else at the Genius Bar while I was there had their computer put back in working order but their data was lost; they weren't the sort of user that it was a real problem, but the occasional reminder that backups are important is always helpful.

## Friday, April 03, 2015

### Women in Math (and Computer Science)

Wednesday afternoon I went to a panel organized by the Harvard Undergraduate Mathematics Association for a "Gender Gap on Math Discussion". I went both to hear what was going on (as there's still that gender gap in CS, and we're always eager to hear ways that we might do things better that might reduce that), and for moral support for some people I know who were involved. The panel was co-organized by Meena Boppana, an undergraduate who did research with me the summer before last and is currently a star in my graduate class, and one of the panelists was Hilary Finucane, a graduate student who I advised on her senior thesis and collaborated with on multiple papers with when she was an undergraduate at Harvard. I should note that Meena and several others had done a survey of Harvard math undergraduates which had highlighted some issues that would be a starting point for discussion.

I could only stay for the first half or so, but it seemed very positive. A number of faculty showed up, which was promising. My take on the panel's take was was that they were interested in how to make improvements in the culture, and the goal was to try to start figuring out how that could happen, in part by sharing their experiences. The discussion was both balanced and thoughtful, presented positives with negatives, but focused on how to improve things. There's a writeup in the Crimson with more details. The main point that came out in the first half was something I've seen and heard before: the importance of having a community, including (but not necessarily limited to) a community of women that can offer support, guidance, and mentorship, but also just so you don't continually feel like the only woman in the room.

And as long as we're on the subject, there's been a number of recent stories (or older stories where I've recently seen the links) on women in math and computer science. Focusing on Harvard to start, there's a nice writeup about Harvard's Women in Computer Science group, which has helped provide that community that encourages women to take classes in and concentrate in computer science. An article from last year discusses progress at Harvard in closing the gender gap in computer science. There was even an article in the Harvard Political Review covering gender gap issues at Harvard.

Outside of Harvard, from sources on Google+ I've seen a blog post with an interesting slide deck from one Katie Cunningham that provides a great starting point of discussion about the culture and women in computer science. And, finally, a link to something simultaneously sad and funny (things-male-tech-colleagues-have-actually-said-annotated) that reminds us why we have to keep trying to improve the culture.

I could only stay for the first half or so, but it seemed very positive. A number of faculty showed up, which was promising. My take on the panel's take was was that they were interested in how to make improvements in the culture, and the goal was to try to start figuring out how that could happen, in part by sharing their experiences. The discussion was both balanced and thoughtful, presented positives with negatives, but focused on how to improve things. There's a writeup in the Crimson with more details. The main point that came out in the first half was something I've seen and heard before: the importance of having a community, including (but not necessarily limited to) a community of women that can offer support, guidance, and mentorship, but also just so you don't continually feel like the only woman in the room.

And as long as we're on the subject, there's been a number of recent stories (or older stories where I've recently seen the links) on women in math and computer science. Focusing on Harvard to start, there's a nice writeup about Harvard's Women in Computer Science group, which has helped provide that community that encourages women to take classes in and concentrate in computer science. An article from last year discusses progress at Harvard in closing the gender gap in computer science. There was even an article in the Harvard Political Review covering gender gap issues at Harvard.

Outside of Harvard, from sources on Google+ I've seen a blog post with an interesting slide deck from one Katie Cunningham that provides a great starting point of discussion about the culture and women in computer science. And, finally, a link to something simultaneously sad and funny (things-male-tech-colleagues-have-actually-said-annotated) that reminds us why we have to keep trying to improve the culture.

## Thursday, April 02, 2015

### On the Shannon Centennial

I found in my snail mail mailbox my paper copy of the IEEE Information Theory Society Newsletter. First, I was delighted by the news that Michelle Effros (of Cal Tech) is the new President of the IEEE Information Theory Society. Michelle has a long history of service (as well as, it goes without saying, outstanding research) in the information theory community, and is a great selection for the job.

In her opening column, Michelle discusses the importance of letting people outside of their community know what the information theory research community is doing, especially with the Shannon Centennial (April 30, 2016 will be the 100th anniversary of his birth) coming up. The IT Society will be spearheading outreach efforts as part of the Centennial. As Michelle says,

I have always thought that the Information Theory community and the computer science community -- particularly on the theory side -- should interact and communicate more, as there are huge overlaps in the problems being studied and still significant differences in techniques used (although there's more and more crossover in this regard). Perhaps the Shannon Centennial will provide some grand opportunities for the two communities to come together, to promote the Shannon legacy, and as a side benefit to learn more from and about each other.

In her opening column, Michelle discusses the importance of letting people outside of their community know what the information theory research community is doing, especially with the Shannon Centennial (April 30, 2016 will be the 100th anniversary of his birth) coming up. The IT Society will be spearheading outreach efforts as part of the Centennial. As Michelle says,

Every school child learns the name of Albert Einstein; his most famous equation has somehow entered the realm of popular culture. Why is it that so few people know the name or have heard about the contributions of Claude Elwood Shannon?In Computer Science, Turing is our "guiding light", and we had a very successful centenary celebration -- as well as a recent popular movie The Imitation Game -- to make Turing's life and work as well as the importance of computer science as a scientific discipline more well known and understood to the rest of the world. But Shannon, too, is one of the guiding lights of computer science; it is hard to imagine large parts of computer science theory and networking, for example, without the foundations laid out by Shannon in developing the theory of communications.

I have always thought that the Information Theory community and the computer science community -- particularly on the theory side -- should interact and communicate more, as there are huge overlaps in the problems being studied and still significant differences in techniques used (although there's more and more crossover in this regard). Perhaps the Shannon Centennial will provide some grand opportunities for the two communities to come together, to promote the Shannon legacy, and as a side benefit to learn more from and about each other.

## Saturday, March 28, 2015

### Links: HBR article on Women in STEM and AAUW Report

A latest article on bias issues for women in STEM by Harvard Business Review.

A slide presentation by AAUW related to their report: Solving the Equation: The Variables for Women’s Success in Engineering and Computing. The full report can be downloaded for free here.

A slide presentation by AAUW related to their report: Solving the Equation: The Variables for Women’s Success in Engineering and Computing. The full report can be downloaded for free here.

## Wednesday, March 18, 2015

### Double Hashing (Lueker and Molodowitch)

A subject I've grown interested in, related to multiple-choice hashing schemes, is when (and why) double hashing can be used in place of "random hashing" with an asymptotically negligible difference in performance.

One early, useful work on this subject is by Lueker and Molodowitch. They provide a very nice coupling argument between double hashing and random hashing in the setting of open address hashing in their paper More Analysis of Double Hashing. (The original paper appeared in STOC 1988.) In this post I'll summarize their argument. I apologize that both the text and my exposition might be a little rough.

They work in the open address hashing setting; each key runs through a permutation of the table locations when it is being placed, and it placed in the first empty location, with each location holding a single key. When searching for a key, we run sequentially through its permutation; we either eventually find the element or we find an empty slot, in which case we know the key was not in the table, and the search was unsuccessful. We measure the expected time for an unsuccessful search when a table with m slots is loaded with pm keys for a constant fraction p. For convenience we will have m be prime, as this will simplify matters when we consider double hashing. If each key's permutation is completely uniform over all permutations, we call this random hashing, and the expected time to search for key not in the table is 1/(1-p) + o(1); with some work you can get that it is 1/(1-p)+O(1/m), but we will not concern ourselves so much with the low order terms here.

With double hashing, for a key x, the permutation is given by h_1(x)+ j h_2(x) mod m for hash functions h_1 and h_2, where h_1(x) is uniform over the range [0,m-1], h_2(x) is uniform over the range [1,m-1], and the permutation takes the values in the order j=0,1,2,... This gives a permutation (because m is prime), and with double hashing, you just need two random hash values, which from a theoretical standpoint is "much less randomness" than a fully random permutation, and from a practical standpoint is easier to implement.

What Lueker and Molodowitch show is that for any (constant) load factor p, with double hashing, the expected time for an unsuccessful search remains 1/(1-p) + o(1). They show this through a coupling, which shows that double hashing and random hashing can be coupled so the "the same thing happens" -- that is, the key goes into the same slot -- under both double hashing and random hashing most of the time. Unfortunately, it doesn't happen all the time; the coupling is not strong enough to say that all the keys are placed the same with high probability. But they show that they can arrange the coupling so that thing work out nicely just the same.

To start, let us start with a setting where we have loaded our tables with n keys using random hashing, and now take two copies of our state, and consider a single step of random hashing in one copy and double hashing in the other copy, side by side. Clearly, for random hashing, the probability that a key is placed in any empty slot is 1/(m-n) for each slot. In expectation (over the random past), by symmetry, for double hashing the expected probability that a key is placed in any empty is 1/(m-n), but the actual probability for each slot will depend on the configuration. But what they show, using Chernoff bounds, is the the actual probability the key is placed in each slot is at most q/(m-n) for some q that is (1+o(1)), with high probability over the past random placements of the n keys.

Now for the coupling. Starting from empty, at each step we use double hashing in both of our copies with probability 1/q = 1- o(1). Note that this ensures that the probability a key is placed in the "random hashing" copy of the process is at most 1/(m-n), so far. So with probability 1/q, we have placed the key in the same slot in both tables, and so it is as though we've done random hashing for this step.

But what about what happens with probability 1-1/q? Maybe we could ignore it, if 1/q was 1-o(1/n) for example, as a low probability event; unfortunately, that's not the case. In particular, we actually expect that the coupling will fail for some smallish (polylogarithmic) number of steps.

Instead, with probability 1-1/q we place the key so that the step follows random hashing in total. I'm not saying with probability 1-1/q we place the key at random; I'm saying we place the key so that, in total (including the 1-1/q probability first step where they key was placed by double hashing) we place the key so that, overall, the probability any empty slot obtains the key is 1/(m-n). Another way of thinking about this is in the other direction; my coupling always placed the key according the random hashing, and with probability 1/q (which again is very close to 1) that matches what would be done with double hashing.

So in our random hashing copy of the table, we just placed a key according to random hashing. How should we think of what is happening over in the double hashing copy? For that table, with probability 1/q all went fine -- a key was placed by double hashing -- and with probability 1-1/q some key just dropped into the table that wasn't placed by double hashing. It's like an extra present from above. But it's not a key placed by double hashing.

The next part of the argument is to recognize that that's OK, in the following sense. If you simply add a key anywhere is an open addressed hash table, you just make things worse, in a very specific way. Any slot in the table that would have been filled if you hadn't put in that key will still be filled at the end of the process even when you add that key. That is, if S is the set of slots that would contain a key if no extra keys get placed, and S' is the set of slots that contain a key if you, at various points in the process, just add some extra keys anywhere at any point, then a simple induction gives S is a subset S'.

So now let's consider multiple steps of this coupling. At each step, the ball is actually placed according to random hashing, so at every point in the process, the "state" is that of a random hashing process. On the double hashing side of the coupling, with probability 1/q a ball was placed by double hashing, and with probability 1-1/q an extra ball was just placed. So if we count the number of balls placed by double hashing, when we reach the time when n keys have been placed by double hashing in this process, on average n/(1-1/q) = n(1+o(1)) keys (in expectation -- by Chernoff bounds one can get a high probability result) have been placed overall.

The result: placing n keys by double hashing is stochastically dominated (in terms of the keys that have been placed) by placing n(1+o(1)) keys by random hashing. In particular, after we place n=pm keys using double hashing, the expected time for an unsuccessful search is bounded above the expected time for unsuccessful search after putting in pm+o(pm) keys using random hashing, which is 1/(1-p) + o(1). You can do a similar sort of coupling to show that double hashing stochastically dominates placing n(1-o(1)) keys by random hashing. As a result, asymptotically, there's only an o(1) difference in terms of the expected time for unsuccessful search, a result which explains the negligible difference in performance one sees in implementation.

One early, useful work on this subject is by Lueker and Molodowitch. They provide a very nice coupling argument between double hashing and random hashing in the setting of open address hashing in their paper More Analysis of Double Hashing. (The original paper appeared in STOC 1988.) In this post I'll summarize their argument. I apologize that both the text and my exposition might be a little rough.

They work in the open address hashing setting; each key runs through a permutation of the table locations when it is being placed, and it placed in the first empty location, with each location holding a single key. When searching for a key, we run sequentially through its permutation; we either eventually find the element or we find an empty slot, in which case we know the key was not in the table, and the search was unsuccessful. We measure the expected time for an unsuccessful search when a table with m slots is loaded with pm keys for a constant fraction p. For convenience we will have m be prime, as this will simplify matters when we consider double hashing. If each key's permutation is completely uniform over all permutations, we call this random hashing, and the expected time to search for key not in the table is 1/(1-p) + o(1); with some work you can get that it is 1/(1-p)+O(1/m), but we will not concern ourselves so much with the low order terms here.

With double hashing, for a key x, the permutation is given by h_1(x)+ j h_2(x) mod m for hash functions h_1 and h_2, where h_1(x) is uniform over the range [0,m-1], h_2(x) is uniform over the range [1,m-1], and the permutation takes the values in the order j=0,1,2,... This gives a permutation (because m is prime), and with double hashing, you just need two random hash values, which from a theoretical standpoint is "much less randomness" than a fully random permutation, and from a practical standpoint is easier to implement.

What Lueker and Molodowitch show is that for any (constant) load factor p, with double hashing, the expected time for an unsuccessful search remains 1/(1-p) + o(1). They show this through a coupling, which shows that double hashing and random hashing can be coupled so the "the same thing happens" -- that is, the key goes into the same slot -- under both double hashing and random hashing most of the time. Unfortunately, it doesn't happen all the time; the coupling is not strong enough to say that all the keys are placed the same with high probability. But they show that they can arrange the coupling so that thing work out nicely just the same.

To start, let us start with a setting where we have loaded our tables with n keys using random hashing, and now take two copies of our state, and consider a single step of random hashing in one copy and double hashing in the other copy, side by side. Clearly, for random hashing, the probability that a key is placed in any empty slot is 1/(m-n) for each slot. In expectation (over the random past), by symmetry, for double hashing the expected probability that a key is placed in any empty is 1/(m-n), but the actual probability for each slot will depend on the configuration. But what they show, using Chernoff bounds, is the the actual probability the key is placed in each slot is at most q/(m-n) for some q that is (1+o(1)), with high probability over the past random placements of the n keys.

Now for the coupling. Starting from empty, at each step we use double hashing in both of our copies with probability 1/q = 1- o(1). Note that this ensures that the probability a key is placed in the "random hashing" copy of the process is at most 1/(m-n), so far. So with probability 1/q, we have placed the key in the same slot in both tables, and so it is as though we've done random hashing for this step.

But what about what happens with probability 1-1/q? Maybe we could ignore it, if 1/q was 1-o(1/n) for example, as a low probability event; unfortunately, that's not the case. In particular, we actually expect that the coupling will fail for some smallish (polylogarithmic) number of steps.

Instead, with probability 1-1/q we place the key so that the step follows random hashing in total. I'm not saying with probability 1-1/q we place the key at random; I'm saying we place the key so that, in total (including the 1-1/q probability first step where they key was placed by double hashing) we place the key so that, overall, the probability any empty slot obtains the key is 1/(m-n). Another way of thinking about this is in the other direction; my coupling always placed the key according the random hashing, and with probability 1/q (which again is very close to 1) that matches what would be done with double hashing.

So in our random hashing copy of the table, we just placed a key according to random hashing. How should we think of what is happening over in the double hashing copy? For that table, with probability 1/q all went fine -- a key was placed by double hashing -- and with probability 1-1/q some key just dropped into the table that wasn't placed by double hashing. It's like an extra present from above. But it's not a key placed by double hashing.

The next part of the argument is to recognize that that's OK, in the following sense. If you simply add a key anywhere is an open addressed hash table, you just make things worse, in a very specific way. Any slot in the table that would have been filled if you hadn't put in that key will still be filled at the end of the process even when you add that key. That is, if S is the set of slots that would contain a key if no extra keys get placed, and S' is the set of slots that contain a key if you, at various points in the process, just add some extra keys anywhere at any point, then a simple induction gives S is a subset S'.

So now let's consider multiple steps of this coupling. At each step, the ball is actually placed according to random hashing, so at every point in the process, the "state" is that of a random hashing process. On the double hashing side of the coupling, with probability 1/q a ball was placed by double hashing, and with probability 1-1/q an extra ball was just placed. So if we count the number of balls placed by double hashing, when we reach the time when n keys have been placed by double hashing in this process, on average n/(1-1/q) = n(1+o(1)) keys (in expectation -- by Chernoff bounds one can get a high probability result) have been placed overall.

The result: placing n keys by double hashing is stochastically dominated (in terms of the keys that have been placed) by placing n(1+o(1)) keys by random hashing. In particular, after we place n=pm keys using double hashing, the expected time for an unsuccessful search is bounded above the expected time for unsuccessful search after putting in pm+o(pm) keys using random hashing, which is 1/(1-p) + o(1). You can do a similar sort of coupling to show that double hashing stochastically dominates placing n(1-o(1)) keys by random hashing. As a result, asymptotically, there's only an o(1) difference in terms of the expected time for unsuccessful search, a result which explains the negligible difference in performance one sees in implementation.

## Monday, March 16, 2015

### Power of Randomness at Georgia Tech

I'm spending (part of) the week at "The Power of Randomness in Computation Workshop", an IMA (Institute for Mathematics and its Applications) and ARC (Georgia Tech Algorithm and Randomness Center) co-sponsored workshop at Georgia Tech. Here's the schedule. I'm told slides will eventually be put up somewhere on the IMA website for such things. Great organization at Georgia Tech -- a big crowd in a very nice room, lots of food and coffee, all very well organized. They even had Ben Affleck waiting in front of the building for us this morning. He seemed to be a little busy shooting a movie to greet us properly, but maybe he'll have a bit more time to chat tomorrow.

Besides Ben, a few other highlights:

Leslie Goldberg started things of talking about the complexity of approximating complex-valued Ising and Tutte partition functions. I remember the Ising/Tutte models (mostly from graduate school and shortly after); now there are connections between various problems in quantum computing and these functions on complex values, which (or course) I had not known.

Nike Sun gave a talk on the exact k-SAT threshold (for large k). It was very clearly presented and gave the argument at the intuitive level. I gained some insight into why the "locally random tree" type argument I've enjoyed in coding/belief propagation arguments breaks down in certain satisfiability problems, due to clustering of solutions and other challenging correlations, and how those issues can be handled. I started to understand (I think) the point of replica symmetry breaking arguments and how they were used to guide the analysis of the k-SAT problem.

Other talks from the day: Amin-Coja Oghlan also talked about replica symmetry techniques and their uses for random graph coloring problems, Eli Upfal talked about some new shuffling techniques for oblivious storage dubbed the Melbourne shuffle, Aravind Srinivasan gave a talk on the Lovasz Local Lemma (staring from the Moser-Tardos results and showing how these arguments carry forward and give greater power and insight into the use of the LLL for additional problems), and I talked about invertible Bloom lookup tables and briefly mentioned a few other unrelated things in progress.

Besides Ben, a few other highlights:

Leslie Goldberg started things of talking about the complexity of approximating complex-valued Ising and Tutte partition functions. I remember the Ising/Tutte models (mostly from graduate school and shortly after); now there are connections between various problems in quantum computing and these functions on complex values, which (or course) I had not known.

Nike Sun gave a talk on the exact k-SAT threshold (for large k). It was very clearly presented and gave the argument at the intuitive level. I gained some insight into why the "locally random tree" type argument I've enjoyed in coding/belief propagation arguments breaks down in certain satisfiability problems, due to clustering of solutions and other challenging correlations, and how those issues can be handled. I started to understand (I think) the point of replica symmetry breaking arguments and how they were used to guide the analysis of the k-SAT problem.

Other talks from the day: Amin-Coja Oghlan also talked about replica symmetry techniques and their uses for random graph coloring problems, Eli Upfal talked about some new shuffling techniques for oblivious storage dubbed the Melbourne shuffle, Aravind Srinivasan gave a talk on the Lovasz Local Lemma (staring from the Moser-Tardos results and showing how these arguments carry forward and give greater power and insight into the use of the LLL for additional problems), and I talked about invertible Bloom lookup tables and briefly mentioned a few other unrelated things in progress.

Subscribe to:
Posts (Atom)