Monday, December 15, 2014

Stress: Competition and Ranking

One issue I keep seeing in comments here and elsewhere on this issue is that academia is very competitive, with everyone worried about their rank.  In my last post, I admitted that it would be hard to completely deny that there is competition, particularly when people are younger, which tends to come out when jobs are at stake.  But, for the most part, I think the role of competition is completely exaggerated, strangely so.  Academia -- at least, certainly, my branch of it -- thrives on collaboration, and I believe people who go into it thinking that it is a big competition are going to lose out, both professionally and in their enjoyment of the profession.  (Indeed, one of the reasons I write these posts is because I don't want undergraduate and graduate students getting what I think is a one-sided, incorrect view of how academics work.)

First, I would again like to compare academics with other professions.  I've seen comments (including on this blog) that other professions are less competitive than academia, and people are less worried about rank.  I think people who are making that suggestion need to check in with people working in those professions, because I think they're ridiculously wrong.  Lawyers coming out of school to get jobs, doctors trying to get fellowships or residencies, consultants at consulting firms -- all very competitive.  As you move up, this continues.  For lawyers, there's who can bill the most hours, and the challenge to make partner;  for doctors, who can get the best positions at hospitals;  and for businesspeople, every promotion is a step up.  Even for software engineering types, there's competition.  When I was a graduate student, I recall visiting friends who had gone to a large well-known company, and for a large part of the evening all they talked about was what "level" they and others were in the company and the annual reviews and who was doing what project that might get them ahead.  So let's not be delusional and start by acknowledging that there's competition everywhere, and that's unsurprising when jobs and money are at stake.  While I'm not suggesting I have deep knowledge of all of these careers, I think academics have much less competition than most.  

If academics appear like they're concerned about ranking, perhaps it's because they appear to be easy to rank.  First, as I pointed out last post, there's not that many of us.  Second, there are obvious metrics everyone can understand:  number of papers published, number of papers appearing in "top" conferences, and h-index stand out.  I'm not suggesting these are good metrics -- but they're easy and to a first order give (potentially) some useful information.  They're a quick way of bucketing or sorting people, particularly those without an established track record and are therefore not necessarily widely known or visible in the field, and therefore have more of an impression and an impact on younger academics.

But very quickly after your PhD, this sort of ranking loses its importance, and the very idea of ranking starts to lose its value -- as many have noted in a variety of venues.  In academia, there's lots of ways to be successful, many points on the Pareto frontier.  There are many great results waiting to be found and many different subfields to work in.  At the end of the day, a history of good work and specific achievements is what people look for;  there's not really a finite pool of such things for which to compete.  Indeed, I'm not sure how I would go about competing with the top people in the field, except to try to do interesting work, which is what I'm trying to do anyway.  (A secondary way to compete is just to make sure people know about your work.  But giving talks is less important to being successful than doing the work that goes into the talks in the first place;  again, it can have a bigger impact for people in the early stages of their career.) 

Against this idea of competition, just look at how people in academia work together.  In computer science theory, in particular, most papers have several authors working together.  In a number of cases these are students with their advisors, but a closer look reveals that in many cases, they are not.  Credit can be shared easily in academia, and collaborations can lead to results that individuals could not get alone.  Working in groups is a way for people to get more done.  Instead of competition, collaboration often yields the path to having a greater impact on the field.  Rather than being a "competitive game", research is more a "cooperative game".  (As an aside, this is why theory's approach of alphabetical order for authors rather than some sort of implicit "credit scheme" based on author order makes such great sense.)  In most areas of computer science that I've participated in, a similar spirit prevails.

I encourage graduate students to team up pick out projects to work on together (and have seen this at other places, also -- one of my best experiences as a graduate student was such a project).  It gives them something to do and a sense of ownership outside of working with their advisor.  And, importantly, in reinforces that these other students are their colleagues, and that working together is a great idea that can gain for everyone.  Hopefully, they also learn that working together is more fun and generally more productive than working alone.  When it comes to hiring time, it's nice to see students who have worked on such team projects, because I typically prefer colleagues with a track record of working well with others.     

Sometimes small competitions break out, sure -- multiple groups are working on the same or similar problems.  Often, though, this is a very healthy competition, pushing progress forward in an area over a series of papers.  I remember in the past an argument with another group when we were working on similar problems and an issue of "credit" in the writeup of the various papers came up.  A week later, we were starting collaborations together on new ideas.  That's not exactly the sign of a super-competitive landscape. 

It could be that I've just got a mistaken impression of the field.  Harvard is a wonderfully collaborative place, and personally I've found overall I like working with others more than on my own.  But when I think of cutthroat competition, I don't think of the job I'm in.

To conclude the post, I think what may be going on is people confuse "competition" with "drive".  Most academics are smart, successful people, who are internally driven by a desire to do great work.  To many, that must appear like "competition", but if so, it's internal competition -- you're not out to beat others, but to be our best.  And I think it's very possible academia does have more "Type A" personalities that have this internal drive that is, surely, not always healthy.  It's not clear to me that this academia's fault -- such people would be similarly driven in any career -- but, if it is true, then it suggests we might consider if this is best for our field, and how we might open up the field to a wider set of personalities or how we might make work in this field healthier for this type of personality.  





Wednesday, December 10, 2014

Stress, Continued: Jobs

Continuing from last post, I aim to focus on some of the issues related to stress in academia.  Today's post will be related to stress and the nature of academic employment.

The most stressful issues I can think of in academia relate to finding or keeping your job.  Graduating and finding the first job, especially when the job market is tight, cannot help but lead to stress for most people.  (Indeed, these days, the process seems to be getting worse -- as postdocs become normalized in Computer Science, many people have to find the their "first job" two or more times.)  Similarly, when you come up for tenure, it's obviously very stressful.  Even if you think you deserve tenure, there is uncertainty because the process it outside your control, and the 1-bit decision can have a tremendous impact.  While changing jobs later can also be stressful, and I'll discuss that, these two times seem to be the worst.

As a starting point, however, compared to other professions, the issues related to getting a first job and tenure are not especially unique to academia.  Doctors have residency and specialization after medical school, lawyers wait years to become partner.  Business people have it different, perhaps;  tech startups notwithstanding, there are people who still climb up the corporate ladder.  The tenure situation is, arguably, a more severe end goal than in other professions, but with what seems to be a commensurate reward;  you don't really have to worry about a job after, if you are willing to stay at your institution.  The framework and corresponding stress seem comparable to many other career paths, although there are career paths that avoid such poignant milestones.  In particular, in computer science, many students from top institutions can quickly and readily find work at major companies or startups, and their stress seems to come from having a wealth rather than dearth of choices.

Would there be changes to the system that would help?  I don't think so, and this requires some explanation.  The issue of job stress, to me, seems fundamentally an issue of supply and demand.  In the top 20 computer science departments there are approximately 1000 professor jobs.  Heck, maybe I'm off by a factor of 2, depending on how you count (tenured vs all, maybe top 30 instead of 20).  Positions turn over very slowly.  In short, academia is a niche market, with a small supply of jobs and, generally, heavy demand for them.  This creates a lot of friction in the system.

In years past, we've seen a slowdown in the CS academic job market.  Even small slowdowns create tough situations with the small job supply.  The solution was to introduce more postdocs.  It was a working solution for the time, but with risks and downsides -- a de facto postdoc requirement added into the employment picture? -- that leaves us with as many questions as answers.

Similarly, tenure seems like such a huge deal in large because one cannot readily move to another position easily.  Yes, there is the issue of being rejected, and the corresponding loss of prestige that goes with it, but even ignoring that, the challenge of where to go (in academia) if tenure is not granted looms large.  The problem extends past tenure.  Even very good people can find it hard to move, as the small number of jobs available make for an inflexible, challenging job market.  If I was at Google and wanted to work at Facebook or some other company, such a move should not, generally, be difficult;  people make such switches pretty regularly.  If I walked over to MIT and told them I would like to move there, there are huge barriers, perhaps the most obvious of which is MIT has a wealthy choice of wonderful people it could choose to hire instead, and is careful in choosing who it hires to one of those limited slots.  Indeed, at most other schools, the obvious issue is whether there would even be a senior position available, but MIT computer science always seems to have a position if it wants one.*  The upshot is if later in life a professor wants to switch jobs for any number of personal reasons (dissatisfaction with the current location, divorce or family issues, etc.), it's not always easy or possible to do and stay in academia.   The problem is again related to scarcity in jobs, and solving that problem seems out of reach, involving changes to institutional or societal structures.  (That is, we have to convince our universities we need to hire more CS faculty, and/or we need to convince society to spend more on research/professors/education.)

The job stress issue is the most prominent stress issue I see in academia, and I think it underlies a lot of other stresses.  When people say academics are over-concerned with and spend too much time jockeying for being ranked highly -- a point I have issues with and I'll return to in a later post -- to the extent that's true, I think some of it is inescapable in such a job market.  When you graduate, you'll be compared to other graduates and that will have an effect on what jobs you can get.  When you are up for tenure, you'll be compared to peers, implicitly or explicitly.  If you want to switch jobs post-tenure, how you compare to people at the institution you wish to move to and to others in your field is important.  All of these comparison points become more important in a tight, friction-filled job market.  As much as I'd like that not to be the case or deny it, I think it's better to face the reality of it.

What possibility is there for a solution?  The easiest I can think of is to expand the job market, which I think comes from industry jobs.  As part of this, we have to help make sure industry sees the value and importance of research, and of having its own researchers.  Some people have said that academics look down on "non-academic" employment for students and especially PhDs.  I don't think that's generally true in CS, but to the extent it is, that's setting unrealistic expectations for outcomes for many -- or most -- of our students.  The virtues of jobs in industry are well documented, as are the virtues of academic jobs;  maintaining a culture where both are seen as positive outcomes seems healthiest for stress levels of individuals and for the health of the CS academic community generally.  

Other solutions welcome and more to come.

* Just kidding around with you, MIT.  But seriously, is there any upper bound on you all?

 
    


Tuesday, December 09, 2014

Stress

I've seen lots of news and posts recently on the general theme of stress in academia. Daniel Lemire blogs about Stefan Grimm, a professor who recently committed suicide, and talks about academia as an anxiety machine.  More close to home, we recall that Seth Teller committed suicide earlier this year.  Previous Harvard faculty member and loud blogger Matt Welsh frequently portrays academia as a stressful place (most recently discussing the Fame Trap), and recently Harvard's Radhika Nagpal offered her advice about the scary myths of academic life and how to find happiness.

So here is my question, and I don't think it's a simple one:  is the academic world a stressful place?  The above are an interesting set of anecdotes, but there are certainly contradictory anecdotes. 

For example, as a starting point we might look to the CareerCast list of least stressful jobs, where university professor was judged the least stressful job in 2013 (and nearly the least in 2014), and the reaction to it.  (A nice Forbes article here, and last year's related article here with hundreds of comments.  Inside Higher Ed articles here (2014) and here (2013).)  The big quote this year from Inside Higher Ed:  
Tony Lee, publisher, CareerCast, added via email: "We received a lot of feedback about our ranking of university professor as a low-stress job. But we found that while adjunct and part-time teachers are right that their jobs can be stressful, the stress levels for tenured university professors – which is what we rank – are lower than the majority of other jobs we measure in our report."
My own take on this amusement (before we get to the serious) is that they probably underestimate the stress levels of professors -- even tenured, university profs.  On the other hand, since CareerCast's "most stressful" jobs include military, police, firefighters, and event coordinators, I'm not going to argue against the point that university professors may have it comparatively easy.

Now, more seriously, is life working in academia stressful?  I think it's an important question;  to the extent that academia is more stressful than it should be, it may be damaging both to individuals, and to the overall success of scientific productivity on a larger scale.  On the other hand, perhaps the level of stress in academia is natural for various reasons, including the type of work involved, or the type of people involved, and is not really so bad either in relative or absolute terms.   

I don't know the answer to this question.  I could speak for myself (this is my blog, after all), but that's just more anecdote.  I poked around a little bit but didn't see anything especially informative in the research literature I saw;  I'd be happy to have pointers suggested.  Also, it's a problematic question, for several reasons, as there are all sorts of underlying questions to answer first.

1)  Stressful compared to what?  What are the right comparisons?  I tend to think the comparisons should be with other professional jobs (doctor, lawyer, businessperson -- and, for CS, I guess the appropriate "software engineer" type title), which seem to me to pretty stressful employment options in their own right. 

2)  Stressful at what time in one's career?  I think there are different stressors, and different amounts of stress, at different career points.  Grad student stress is not the same as assistant faculty stress is not the same as tenured faculty stress.

3)  How would one measure "stress", anyway?  CareerCast gives some information on its criteria which leads to university professor being (one of) the least stressful jobs.  Daniel Lemire offers the week-end-freedom test for how free you are (which may be one way of checking stress), but I don't agree with his assumptions. 

4)  And how might one account for various selection effects?  Maybe academics are more generally "Type A personalities".  Perhaps the right questions are not why academics are stressed, but why does academia attract so many Type A personalities, and/or does academia reward Type A personality-types in ways that are detrimental.

5)  Are Computer Science/Engineering in academia different in terms of stress than other academic fields, since these are the areas I care most about?  (Ostensibly, for instance, we get paid higher salaries, and have ample opportunities for outside-university work.  Both of these could affect stress levels.)

I'm sure you can come up with more issues that complicate the basic question.

I may do some more posts on this basic theme;  I do think the topic is important, but there's little hard information.  Seeing as how there's lots of psychologists hanging around universities, I'm surprised that I didn't quickly find one or more recent comparative studies on stress levels, personality types, etc. for academics, but perhaps I just need to look harder.  (If only I had the time!) 

To be clear, my bias is that when I read blog posts from my respected colleagues that portray academic life as some sort of competitive pressure-cooker where everyone is out to increase their ranking and are working 80+ hour weeks without any sort of social life, that's just not the reality I am aware of.  (Including, I think, for the people who actually write these posts, who generally seem to be talking about "other people".) 

To cut off some potential responses, I'm not blind, nor stupid;  there are plenty of stresses in academia, in particular having to do with the employment situation, where the number of jobs is not sufficient for the number of potential job-seekers generated.  And I would be interested in finding useful ways to understanding stress in the field and how to reduce it productively, especially for graduate students.  But to do this, we would need to move beyond individual anecdotes, gain some more data-driven understanding of what's going on, and start determining best practices.  The mythologies about academic life and stress do not seem helpful in understanding where there are problems and how to address them.  





Monday, December 08, 2014

CS 50 Fair (Live)

The madness that is the CS50 Fair at Harvard is going on now.  Hundreds of students demo-ing their final projects.  If you can't be there, you can watch the live stream.  I did a brief stop-by this morning and saw some fun and useful stuff -- one student showed me a simple setup for students to ask questions online in class and have other students up-vote them so the prof can see popular questions.  (There are tools that do this, sure, but it would be nice to have a simple one-off easy-to-use code setup to do this on the fly.  Hey -- student -- if you're reading this, come see me, I want your code. )  Another had a nice interface that gathered information on all the various Harvard events from different websites and pulled them together with some basic search functionality.  I'll have to talk with David Malan about how we're doing transitioning some of these useful prototypes into tools for us or around campus.  Still amazing what energetic CS students can do after an introductory class.  And hopefully I'll go back and see another shift of projects later this afternoon.




Tuesday, December 02, 2014

Yet More Good News, Harry Lewis Edition

As mentioned previously, our current Dean for the School of Engineering and Applied Sciences, Cherry Murray, will be stepping down at the end of the year.

The good news out today is that Professor Harry Lewis will be stepping up to serve as the Interim Dean.  To those in the outside world, you may know Harry for his blog, his books (CS books like Elements of the Theory of Computation (2nd Edition) and Data Structures and Their Algorithms, or more popular books like Excellence Without a Soul: Does Liberal Education Have a Future?, Blown to Bits: Your Life, Liberty, and Happiness After the Digital Explosion, and Baseball As A Second Language), or his time as the Dean of Harvard College. 

For those of us who have been students (or professors, or both) at Harvard, we know Harry as a mentor, a guide, a wise man to come to for advice, and an inspiration.  (Both Salil and I frequently mention Harry Lewis and CS 121 at Harvard as reasons why we ended up as computer scientists.)

We are very fortunate that Harry stepped up to take on this important role, and his work will let SEAS -- and CS at Harvard -- continue to flourish.  As Harry says in the Crimson article above, "“The future of engineering at Harvard has never looked brighter than it does today."  And as FAS Dean Michael Smith says, “He brings to the deanship a thorough understanding of SEAS culture and academic life, exceptional insight into the undergraduate experience as a former dean of Harvard College, and the intellectual and administrative agility needed to effectively guide SEAS during this interim period.”

Monday, November 24, 2014

More Good News, Eddie Kohler Edition

Continuing the good news at Harvard, word has been given that Eddie Kohler has officially been offered tenure.  So we hope/expect/are excited about keeping Eddie here for decades to come.  

Eddie's contributions to computer science are well known -- he recently won the SIGOPS Mark Weiser award --  but perhaps less well known and appreciated are his contributions here at Harvard.  I can't imagine how we were surviving before he got here; he enriches the place constantly through teaching, tools, insights, questions, and countless other ways.  It's great news for us here at Harvard. 

Congratulations to Eddie. 

Thursday, November 13, 2014

The News Is Out

Harvard Computer Science was given a sizable gift from Steve Ballmer;  we're excited at the opportunity to grow this gift will provide us over the next several years.  For more news, see the Crimson article, the Boston Globe article, the New York Times blog writeup, etc. 

Monday, November 10, 2014

Veterans Day

Veterans Day is a federal holiday, but, at Harvard classes go on.  This year, the holiday is observed for staff, but we'll be teaching.

I can't easily find out when the practice of having classes on Veterans Day became standard.  I found this Crimson article from 2004, but it makes it sound like holding classes (it seems to just mention sections, not actual lectures) was not official policy.  I also am not sure when this dichotomy of the day being a staff holiday, but classes are held, started.  I'm a bit embarrassed to say I haven't noticed this scheduling issue until this year, so perhaps the changes are new.

I understand that, from a teaching and scheduling standpoint, it's a bit annoying with the holidays messing up the teaching schedule.  It's also annoying when your children are out of school for the day for a federal holiday while you have to teach.  Also, it seems a bit odd, if not somewhat unpatriotic, "cancelling" Veterans Day for the students (and faculty), although I know that most people generally do not engage in veteran-specific activities for the holiday.  (That's true of our family, and my wife is a veteran.)    

I suppose I might not really have thought too much about it, except that somehow I noticed that Columbus Day is a university holiday at Harvard, and, because I'm sure most readers do not get enough John Oliver in their diet, here what his show has to say about that.  

Happy Veterans Day.


Saturday, November 01, 2014

Weekend Links (November 1 Edition)

A variety of links on related news.

Ebola #1 (local):  The HackEbola with Data website gives detail for the upcoming Harvard Hacking event in conjunction with Statistics without Borders.  If doing some data mining for a cause is your thing, please check it out.  (No, you don't have to be associated with Harvard to participate...)

Ebola #2:  New Yorker magazine has an article out on "The Ebola Wars" that among other things, describes the work done by Harvard colleague Pardis Sabeti and her lab

Ranking again (no, not again!):  This one is tells the story of King Abdulaziz University and the US News and World Report Rankings.  (I note, as follow up from a few years ago, that I remain disturbed that nobody is offering to pay me large sums just to have me affiliated with them for the purposes of raising their citation count.)

Crimson #1:  If you're into this sort of thing, the umpteenth "Stanford vs. Harvard" article.  (Yawn.)

Crimson #2:  Anyone know of a good Dean who might be available?  

Yale:  My take on how you grow a department:  keep telling people you are going to/need to grow the department, and why.  Lather, rinse, repeat.  I hope this article is just the start. 

Tuesday, October 28, 2014

Rankings Everywhere

Like many (theoretical) computer scientists, I spent far, far too much time this weekend thinking about rankings.  Well, really, it was more that I was being far, far too amused by the various posts and comments on the novel ranking site "Ranking of CS Departments based on the Number of Papers in Theoretical Computer Science",  which led to posts at the Computational Complexity blog and in theory, the first of which shut down the comments rather quickly, and the second of which is at this time nearing the amazing 100 comment line.  (Luca's post deserves it -- it was the funniest thing I've seen all week.  Kudos.) 

Now, to fan these flames to further untold heights, I was informed that new US News and World Report Computer Science Rankings are out.  Needless to say, I disagree with them.  How exactly can Harvard be second to MIT?  Ridiculous.

(For those who are concerned about such things, the methodology is also given, though I haven't checked to see what parts of it are reproducible.  And what is that "Web Of Science" thing?  Is it as good as the citation tools in Google Scholar?)

For the ultimate in ranking tongue-in-cheek-iness, and for those who haven't seen it, I also recommend the NRC Deranker, which will (eventually) provide you the ranking you personally agree most with. 


Monday, October 27, 2014

Welcome, Attorneys!

You might wonder why I am welcoming attorneys to the blog.  Well, I happen to know that some attorneys read my blog.  I know this because recently, as an expert witness, while I was testifying in court, the attorney on the other side during my cross-examination specifically asked me about my blog.  Based on their questions, I am reasonably sure they had downloaded all the posts.  I imagine some poor associate attorney low on the totem pole had to read through it all to see if they could find anything embarrassing they could use in the context of the case. 

I don't think it was successful.  But who knows.  At one point, the attorney cross-examining me asked if one of the reasons I liked to do consulting work was because it pays well.  I answered, "I would say that's one of the reasons."  That wasn't a hard answer to give;  it's an honest answer.  But afterward I recalled that, years ago, I had talked about consulting on my blog.  I looked it up, and sure enough, here is what I wrote years ago:
I enjoy consulting, for several reasons:
  1. It fulfills a need to be more directly relevant to the real world.
  2. It gives me a chance to work with different sets of people.
  3. It allows me to learn new things and exercise skills that I don't necessarily use as a professor.
  4. It pays well.
Based on my experience as a witness, I am of very high certainty that the attorney had this post in front of him when he asked the question.  Of course, when he asked the question, he failed to note the other reasons I had given for consulting, or provide me the blog post for context.  But my assumption is that they were simply looking for a "gotcha" moment.  Perhaps the question and my response made me look bad to the jury.  (After subsequent clarifying questions on the issue from the client's attorney, I would like to believe it had no effect.)  I imagine that they were hoping that I would say that the pay wasn't important, in which case I am sure they would have readily put up the post from my blog to show the jury to try to discredit me.   

I talk very little about my legal consulting on the blog or publicly.  The main reason, obviously, is out of respect for and professional obligation to my clients.  But I find I don't blog about it even in very general terms, where confidentiality or other similar issues would not come into play.  And one good reason not to is that an attorney could use anything I write on the subject to try to impeach my credibility as a witness, even if it is out of context. 

Of course, as this example shows, anything I post online can potentially later be used against me in a case or in court.  Or in other situations.  These issues regarding our permanent digital identity, and how modern technology allows our past to easily be brought up and used against us, are not new;  I'm not suggesting I'm offering novel insight here.  Really, I'm just relaying an instance where the issue of my digital identity, in the context of this blog, hit me head on.  Certainly when I wrote that post years ago about consulting I did not think it would come up later in this way.  Mostly it makes me wonder how my children, who are busy exploring their own identities (digital and otherwise), will manage in a world where most of their history will be available online, for people to use or misuse in ways that I could not imagine.     

Friday, October 24, 2014

Harvard Junior Faculty Job Search

Harvard Computer Science is doing a junior faculty search this year.  We pretty much have been doing one every year, but it's nice to be able to make an announcement and point people to the ad.

So here is the ad, telling you what you need so send in if you're applying. 

Also as usual, we're looking in all areas, because we're always interested in great people.  However, in the interest of full disclosure, the focus this year is described as
This is a broad faculty search and we welcome outstanding applicants in all areas of computer science, including applicants whose research and interests connect to such areas as engineering, health and medicine, or the social sciences. Of particular interest are candidates with a focus on systems and data science, including operating systems, networking, software engineering, data storage and management, and computational science.
We'll look forward to your application.

Wednesday, October 22, 2014

Cuckoo Filters

An upcoming paper appearing at CoNext will be of interest to any Bloom Filter user or aficionado:

Cuckoo Filter:  Practically Better than Bloom
(Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher)

Let me describe a cuckoo filter and some of what's in the paper for you.  If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do).  If you want to look at code, there's even a github repository for you with code for cuckoo filters. 

Note:  I'll assume you probably know what a Bloom filter is in the discussion that follows.  As a reminder, it's a hash-based data structure for set membership that can give false positives;  by allowing false positives, you can save space in your representation of the set.  I'll assume you probably know what cuckoo hashing is.  As a reminder, in cuckoo hashing, you have buckets that can contain a fixed number of keys, and each key gets some number d of buckets (obtained by getting d values from hashing the key) where it can be placed.  When you insert a new key, if all of its buckets are full, you can move keys out of that bucket to one of its other buckets to make room.  Since apparently there are drinking games that exist based on taking a drink whenever I use the word "Bloom filter" or "cuckoo hashing" in a talk, I imagine most readers of the blog know about these things.  (Or, always, Wikipedia.)

The framework used in the paper comes from the following idea.  One way of implementing Bloom filter functionality when given a static set of items is to use a perfect hash function;  that is, as hash function that maps the n elements of your set into an array of size n in a 1-1 fashion.  Then, you use another hash function to store a fingerprint of the element instead of the element itself in the array location.  To do a set membership test on an element z, you hash z once to find its location in the array, and hash z again to find its fingerprint, and check against the fingerprint in the array location.  For elements not in the set, you get a false positive rate of 2^{# of fingerprint bits}.

The problem is that the perfect hashing framework doesn't allow you to insert new items naturally.  Or do deletions (but standard Bloom filters don't allow deletions also -- you need a counting Bloom filter for that).  So a natural thing to do is to think of using some other sort of hash table, besides a perfect hash table, and see what happens.  This idea was explored way back in for instance some papers I worked on in 2006 (here and here) using "d-left" hash tables, which are based on a multiple choice hashing scheme.

If you're going to use multiple choice hashing schemes, though, you should think about using cuckoo hashing.  The ability to move keys around means you should get better space utilization;  for example, even with 2 choices, if your buckets can hold 4 items, cuckoo hashing can get you about 95% space utilization.  The problem with cuckoo hashing in this setting is that, for a Bloom filter, you want to just keep fingerprints of keys, not the keys themselves.  So, when you want to move the key, how do you figure out where to move it to -- you no longer have the key to hash?

The approach used in the paper is a trick called partial-key cuckoo hashing.  To get two buckets for our key, we do a hash to get the first bucket (idealized to be uniformly at random), but to get the second bucket, we do not hash the key, but just the fingerprint of the key;  we XOR that value with the value for the first bucket to get the second.  The upside of this approach is given a bucket and a fingerprint in the bucket we can find the other bucket corresponding to the fingerprint.  (XORing the hash of the fingerprint works if the fingerprint is in the first bucket or the second, as is easily checked.)  The downside is if our fingerprint has F bits, that means our second bucket is really only chosen from 2^F possibilities;  it is not uniform conditioned on the first.

Because of this limitation, one interesting aspect of the paper is that we show, in theory, this approach just will not work.  Specifically, there is an Omega(log n) lower bound on the size of the fingerprint that must be stored; this means the cuckoo filter is super-linear in space, unlike standard Bloom filters which are linear in space.  This lower bound is easy to derive.  Suppose that you have buckets that can hold b keys.  If 2b+1 keys have the same pair of buckets, then there is trouble;  you can't fit 2b+1 keys into 2b bucket slots.  So the question is whether there are collections of 2b+1 keys that have the same pair of buckets.  With standard cuckoo hashing, this would be a low probability event.  With partial-key cuckoo hashing, the probability depends on the size of the fingerprint, since this determines how many possible second choices of a bucket there are for each key (given their first random choice of a bucket).  If you have too few choices, you will get too many collisions, or too many keys in a pair of buckets.  Some straightforward calculations yield the Omega(log n) lower bound on the size of the fingerprint.

So now that we've proven theoretically that this won't work, how could it work in practice?  If you go back to those calculations, you find that the lower bound has some nice constant factors.  The bounds you get on the fingerprint size are actually (roughly speaking) log n/(2b).  That constant factor in the denominator, even when b=4, covers a large range of n.  In practice, fingerprints can be quite a reasonable size, and still avoid the potential bottleneck of too large a group of keys all hashing to the same small number of buckets. 

In any case, this may be the first paper I've written that contains a proof that the described data structure fails in theory but works in practice.  (The other way around I generally find much easier.) 

Big open theoretical question:  try to analyze why partial-key cuckoo hashing gets roughly the same load threshold as regular cuckoo hashing.  (Anyone want to work on that?  Let me know...)

This post is long enough;  there's a lot more in the paper, including comparisons with other schemes, bit-saving tricks, and lots of performance results.  Have a look.  

Postscript/Tangent:  In a recent review (for a different paper, with a different theme) one reviewer made the following comment/criticism:  "Despite years of research, no one has found a variant of cuckoo hashing that can outperform simpler methods in realistic situations."  To which I say, hunh?  Yes, I understand linear probing is wonderful in practice due to cache effects and block sizes and in many and perhaps even most situations can yield better performance than cuckoo hash tables.  But certainly not everywhere.  Cuckoo hash tables certainly seem at least potentially better for many hardware settings, and for at least some important applications -- such as this one -- cuckoo hashing seems to be a clear win.  (The issue here is the number of fingerprints you have to deal with on a lookup affects the false positive probability -- so even if linear probing lookups are "faster" due to cache effects and block sizes, if you try to get this sort of space utilization you get larger contiguous blocks which increase your false positives.)

My co-author David Andersen also violently disagrees with this review comment, and points to these recent papers at NSDI and Eurosys.  

Perhaps I'm missing something, but my thought was this was a sign of a bad review (both in content, and in its overall rudeness).  I felt obliged to try to correct the record (and I can only hope the reviewer sees this post).     
 

Friday, October 03, 2014

Andreessen-Horowitz Again

Andreessen Horowitz had a second Academic Roundtable, gathering together academics, VCs, and people in startups to talk for a few days about a variety of topics.  I wasn't sure why I was invited the first time (which I wrote about last year), and am even less clear on why I was invited back, but I again had a very good time. 

Before a few words about the talks, the high order point:  I'm told that most/all of the talks will be put online over the coming weeks.  So if you're interested, you can experience them yourself.  I think it's great that they're doing that this year;  if you like the talks, you should tell them, so they keep doing it.  (If they don't have comments there, you can always comment here.)  Sadly, they don't seem to be up yet, but I'll post the links when they become available. 

Highlights would include a discussion of Bitcoin -- interesting to hear what Ed Felten, well-known Princeton professor and now ex-Chief Technologist of the FTC, thinks about the Bitcoin economy.  Dawn Song of Berkeley gave a general talk on security issues of the present and future, while Dan Boneh of Stanford gave a talk on the power of program obfuscation.  Raj Rajikumar of CMU gave some history and some peeks into the future of driverless cars -- it's not just Google, you know.  Tuomas Sandholm of CMU talked about his take on the starting of startups while still being an academic (based on now-multiple experiences), and Josh Bloom of UC Berkeley (and wise.io) described the differences between writing papers about machine learning and building products using machine learning.

Of course, some heated discussion about the variety of issues that arise between academic work and transfer of technology to startups inevitably ensued.  (Maybe that's why I get invited.)  The key idea repeated by many (on both sides of the fence) in various forms was that businesses (and in particular startups) are very busy going down their road of development and product, and they may see many things out the sides of that road that are very interesting, but don't have time to explore off the road.  Academics are interested in things way off the road, often thinking of issues much further out in time-scale.  And (at least in my opinion) the role academics play is a good thing;  there (obviously) remains a lot of ways the two worlds can interact and cooperate. 

Thursday, September 18, 2014

On Academia vs. Industry.... MSR SVC Closing

I'm one of those professor types that ends up defending the joys of life as an academic versus a career in industry -- some of you who read this blog have probably seen me comment at Matt Welsh's blog or Daniel Lemire's blog/Google+ chain.  And to be clear I'm not anti-industry, I just want to make sure there's a fair discussion.

In that vein, I'm sad to hear that Microsoft Research Silicon Valley will be closing, as described in this article.  I know many people at MSRSV, and have visited there often.  It's a loss to the community to have it close.  I have sympathy for those who work there -- this must be stressful -- but this sympathy is happily diminished because I know the people there are so talented, they will quickly move on to other employment.  (When Digital Systems Research Center closed long ago, a number of the people I worked with in the lab just moved on to some small company that seemed to be just starting to really get going, called Google.  I wonder how it worked out for them.) 

After a moment of silence for the lab, I do feel it necessary to point out that this is one of the "issues" in the academia vs industry life choice.  The life cycle for companies moves rapidly.  That's not a bad thing, but it's a thing.  Disruptions like this are a non-trivial risk in industry, much less so in academia.  (Just look at the history of research labs.)  Again, I'm sure it will work out for the people at MSRSV, but any life shift like this -- even if it ends positively -- is stressful.  Without wanting to overstate the consequences, it's worth pointing out.

Monday, September 15, 2014

Simultaneous Enrollment

The Crimson has an op-ed on simultaneous enrollment, that I agree with.

Harvard does not like simultaneous enrollment, which means a student taking two classes that meet at the same time -- any time overlap counts (whether the whole class or half an hour once a week).  If you want to take a class via simultaneous enrollment, you have to petition the Administrative Board, and your professor is supposed to provide direct hour-per-hour instruction for the class you can't intend.  As a previous Crimson article states:
The Faculty Handbook requires that “direct and personal compensatory instruction” for simultaneous enrollment, but only recently has the Ad Board refused to recognize videotaped lectures as a stand-in for class time.
The article references that for the past several years the Ad Board has accepted recorded lectures, under some additional conditions, as a suitable proxy for the direct and personal compensatory instruction.  This apparently represented a change from their past position, and this last year, while I was on sabbatical, some Standing Committee on Education Policy decided to push back and say no more recorded substitutions. 

(An aside:  I don't know why they used the dated term "videotaped".  My lectures have been recorded and made available online;  these days, I'm not sure any "videotape" is actually involved in the process.)  

I believe I had a great deal to do with the Ad Board accepting recorded lectures the last several years.  My undergraduate class, CS 124, was recorded with very high quality production for the Harvard Extension School, and I made this video available to the students.  Some years ago, students starting approaching me wanting to do simultaneous enrollment for CS 124, and I thought it was fine, because of the availability of class lectures.  (That was, for instance, how the Extension students were taking the class.)  In particular, every year some number of students seemed to want to take CS 124 and a conflicting math class that overlapped, which made it very difficult for a small number of students who wanted to pursue both CS theory and mathematics.  (Both classes were "gateways" to more advanced classes;  at least for me, changing my class time would have just introduced other more challenging time conflicts;  neither class appeared poised to change times.)  But other class overlaps came up as well. 

I initially faced some opposition from the Ad Board when I supported the students' petitions for simultaneous enrollment because I suggested the recorded lectures were a suitable proxy.  They objected, and I objected to their objection.  After some back and forth, we found language which we managed to agree met the language of the faculty rules, with the outcome being the students could, in the end, rely on the recordings of class lectures.  I admit, I believe it helped that I had served on the Ad Board, and had a good working relationship with them, so they were willing to work with me to come to a mutually acceptable solution.  The framework we established was later used for CS 50 and other computer science classes, because I passed on my successful approach with the Ad Board to others.  (David Malan asked me for assistance on the issue, for example.)  There was, however, always some tension, with the Ad Board continuously questioning whether using class recordings to justify simultaneous enrollment was suitable.  Meanwhile, the number of such petitions kept growing. 

Apparently, while I was sabbaticalling, various powers that be have acted to tighten the rules for simultaneous enrollment, and disallow my previous framework.  Interestingly, CS 50 -- the class with the most requests for simultaneous enrollment -- seems to have acquired a blanket exception, which I am happy for, but still leaves the rest of us facing unhappy students who often have to deal with ugly situations.

What I find frustrating is that, to my knowledge, this change is not based on any data about student performance, or any understanding of student behavior.  It's based on a presumption that students are supposed to sit in classes to learn.  David Malan has some great data showing this presumption doesn't seem to have a basis in reality.  Students in CS50 taking the class under simultaneous enrollment don't do any worse than other students.  In fact, if I recall the data right, they do better than average;  this would not be surprising, as they are a self-selected group with apparently high interest in the subject.  David also has great data showing how lecture attendance steadily drops over the semester.  If a substantial fraction of the students are missing class and watching the video anyway, what's the point of blocking simultaneous enrollment?  Maybe because of all this data at David's command CS 50 got its exemption, but I'm sure if we go back we'd find similar findings for CS 124 and other classes.   

Professors have always had the right to refuse simultaneous enrollment petitions for their class, so if the professor expects student attendance, there is no problem.  I don't think it's a radical suggestion to say that for recorded classes, professors and students should have the ability to jointly decide if simultaneous enrollment is a suitable solution to the small number of inevitable class scheduling conflicts.  I'm not sure what combination of faculty and administrative busybodies decided they had to fix what I see as a non-problem, but I'm sure their rule-making (or in this case rule-interpreting), while it only affect a small number of students, will be a significant net negative.  Kudos to the Crimson for pointing out that this is a poor decision, and I only wish I was optimistic that it could be quickly reversed.  

Thursday, September 11, 2014

Biggering

Preliminary course enrollment numbers are in.  Our new theory course, CS 125 (rapid introduction to theory), has 32 undergrads and 5 others, for an enrollment of 37.  Salil and I had predicted and desired in the 20-40 range for the first year of this course, so we're on target.  We hadn't thought about CS 125 being a class for grad students, but it actually makes sense.  Entering CS grad students who haven't had background in theory, or grad students from other related areas (statistics, mathematics, etc.) who want a fast-paced introduction to the basics of CS theory, would both fit in well for this class.  Maybe next year we'll even advertise it for grad students as well.

Our mainstream required-for-majors theory course, CS 121 (the "here's Turing machines and undecidability and NP-completeness" course), has an enrollment of just over 150, the largest it's ever been.  Given that CS 125 was supposed to draw some students from 121, that's even more amazing.  

And the standard CS introductory course, CS 50, has over 800 undergraduates (and over 850 total) signed up, making it now the largest class at Harvard.  This was so noteworthy that the Crimson had an article this morning about it.  As usual, you can find a great quote from Harry Lewis: 
“Harvard students are smart people,” said Harry R. Lewis ’68, former dean of the College and current director of undergraduate studies for Computer Science. “They have figured out that in pretty much every area of study, computational methods and computational thinking are going to be important to the future.”
So the growth continues, at least for another year.

Wednesday, September 10, 2014

You Can Rename My Family for.....

An interesting article in today's Crimson about the background to the announcement that Harvard's School of Public Health will be re-named the Harvard T. H. Chan School of Public Health, for a $350 million donation.  Feel free to comment on your feelings regarding the issue.  (The Kennedy School of Government notwithstanding, Harvard schools have not been "named" via donations historically, although certainly buildings and such have.)  I'm not publicly taking sides -- indeed, it seems like a wonderful debate challenge to figure out arguments on both sides of the issue as to whether naming schools in this way is a good or bad idea.  Even if you think it's a good idea, you might wonder about the price tag...

In particular, I can't help but wonder what the naming rights of the School of Engineering and Applied Sciences could go for?  In the course of our capital campaign, will I have the chance to find out?  Maybe we could get a bidding war going?  How much would SEAS have to get for me to feel good about having "Harvard's XXXX School of Engineering and Applied Sciences" on my letterhead for some appropriate name XXXX? 

At a baser level, Mitzenmacher has always been a problematic name.  Too long.  (On so many standardized forms, I end up as Michae Mitzenmache....)  Maybe I should offer myself up for naming rights.  Sadly, I don't think I'd get $350 million.  

Sunday, September 07, 2014

Teaching Sorting

For many years now, while teaching the introductory Algorithms and Data Structures, I have told students that we won't be starting with (or really covering) sorting and searching, because it's boring.  While that description is an exaggeration (the students see mergesort [an example of divide and conquer] and heapsort [heaps are important as an implementation of priority queues for Dijkstra's algorithm]), I've never thought that staring this class with a long unit on sorting/searching algorithms, which most textbooks like to start with, is the best use of time.

So it was a bit odd to find myself for our new "honors course", CS 125, spending the first day-and-a-half or so on sorting algorithms.  The key was that there was the right motivation for it.  We'll be tackling both algorithms/data structures and complexity, and the theme of how to model computation will play a big role throughout the course.  When we thought about what was the right way to introduce the class, and in particular this theme, sorting seemed like the right choice.

After very quickly refreshing the students on some basic comparison sorts (bubblesort and mergesort), we present the standard lower bound argument for comparison-based sorts.  But then we show how this lower bound can be broken, by assumptions about the data that allow one to go beyond comparisons to other operations.  Specifically, we raced through counting sort, bucket sort (for random data), and sorting via van Emde Boas trees.    (The students in particular seemed to be talking through the details of van Emde Boas trees at the class break, which I'd like to think because they were new and interesting, but possibly was due to the fact that it was the first time I'd ever presented them, and maybe my delivery for that topic needs some tuning.) 

Lo and behold, I find myself finding sorting interesting!  We didn't even get to more advanced interesting possibilities, like data-oblivious sorting or "bleeding edge" parallel sorting.  I feel I need to apologize to sorting, for mistakenly suggesting that it's a boring topic.  I still wouldn't want to start an algorithms/data structures class with multiple weeks of sorting and searching algorithms, but sorting is a natural way to introduce some key concepts in algorithms, and there's fun to be had in the large variety of approaches to sorting.

Thursday, August 28, 2014

Shout-out to Sabeti Lab

A shout-out today to my friend and colleague Pardis Sabeti (and her lab) for their Science article on the Ebola virus that appeared earlier today.  Pardis and her group have been studying the genetics of viral diseases, and in particular the Lassa virus.  So they were there and ready when the recent Ebola virus began and went to work.  They sequenced 99 Ebola virus genomes from 78 patients, and have analyzed the resulting information to gain insight into how the disease is spreading and mutating. They have released the genetic sequences to the public so that the information is available to groups who could use the sequencing to help find and test cures.  This is both very important work, and (sadly) very serious.  As reported, 5 co-authors of the study (health care workers, a lab technician, and the director of the national Lassa fever program in Sierra Leone) have died of Ebola before today's publication.  

Numerous news articles appeared today discussing this work, too many for me to link to.  But the SabetiLab page here contains a great deal of information about her research and various projects.  Her work in biology, which makes powerful use of statistics and computation, should serve as an inspiration for future scientists, and for people who want to use algorithmic and mathematical methods in ways to benefit the world. 

Wednesday, August 27, 2014

Update on SOCG and ACM

I am happy to have an update on the SOCG/STOC colocation issue that arose last week.  Or, better said, Jeff Erickson has an update, the short summary of which is that it looks like there has now been some very useful clarification.  The concerns of the ACM apparently are limited to direct financial support, and the conferences can (formally) co-locate.  I encourage you to read the note on Jeff's blog from Paul, and let me echo Jeff's statement here: 

"Needless to say, this is fantastic news!  I want to publicly thank Paul and the ACM leadership for quickly clearing up this unfortunate misunderstanding."

So say we all.

Tuesday, August 26, 2014

New Article on arxiv on Equitability and MIC

We recently put on arxiv a new draft on "Theoretical Foundations of Equitability and the Maximal Information Coefficient".  This is some follow-on work to a paper that appeared in Science a couple of years ago, where we introduced the idea of equitability.  Essentially, in that Science paper (link to page where you can access the paper), we wanted a statistic that would give back, for samples from a noisy functional relationship, a score corresponding to the amount of noise (or, in that case, to the R^2 of the noisy data relative to the relevant noiseless function), regardless of the relationship type.  The idea was that this would be useful in data exploration settings, where we might have a large number of possible relationship pairs and in particular a number of non-trivially correlated relationships, and we'd want to score them, in some fair way across the possible types of relationships (linear, parabolic, sinusoidal, etc.), so that we could choose the most promising to look at.  We also wanted the statistic to do reasonable things for non-functional relationships.  And, finally, we wanted a pony.  (But we couldn't find a way to put that in the paper.)  The maximal information coefficient (MIC), which we built on top of mutual information, was our proposed statistic.

The paper has gotten some interest.  One thing that we heard was that people wanted a richer theoretical framework for these ideas.  So now we're finally delivering one.  It took a while, because the students involved -- Yakir Reshef and David Reshef -- were off doing crazy, wacky young-people things like going to medical school, making it hard to get cycles for the project.  On the other hand, the time did some good, allowing us to explore to determine the formulation we wanted. The result is, I hope, an interesting mix of ideas from statistics and computer science.  We're eager for feedback as we hope to formally submit somewhere soon. 

In a couple of weeks we should have another paper out on the same topic that is more empirical.  Naturally, when working through the theory, we came up with better algorithms for computing MIC, and it made sense to separate those results (and some others) into another paper.

Saturday, August 23, 2014

Is the ACM "Retaliating" Against SOCG?

Friday afternoon Jeff Erickson posted at Making SOCG blog some "bad news".  Some background:  very recently, the Symposium on Computational Geometry, or SoCG, decided to leave the ACM, for various reasons.  There had been plans in the works for SoCG to be co-located with the Symposium on the Theory of Computing, or STOC, one of the flagship general theory conferences, in 2016.  STOC is an ACM conference.  Reportedly, Paul Beame, chair of SIGACT (the ACM theory special interest group) sent a note that included the following (emphasis added by Jeff):
With SoCG leaving ACM we have been told that SIGACT and ACM conferences cannot have any formal arrangements at all with the new conference or do anything official that might support it. (This decision was made at the very top of ACM and not by the staff.) This rules out any joint sessions... It also means that SIGACT will have to end our participation in this formal coordinating group.
While I await more information and do not want to rush to judgment, if this description is true, I find it unacceptable behavior on the part of the ACM.  Their job is to promote computer science.  (The ACM home page begins with:  "ACM, the world’s largest educational and scientific computing society, delivers resources that advance computing as a science and a profession.")  Their job is not to try to monopolize and control the computer science conference market.

I encourage everyone to read Jeff's post (here's the link again).  Obviously, this is an issue for Computational Geometry, and an issue for Theory.  But I would say it's an issue for all ACM members, and in particular those that publish research at ACM conferences.  It would be nice if, should it become necessary, members from other areas in computer science voice their displeasure with ACM's actions in this case.


Thursday, August 21, 2014

Hashing Summer School

Back in July I took part in the Hashing Summer School in Copenhagen.  This was nominally set up by me, Rasmus Pagh, and Mikkel Thorup, though Mikkel was really the host organizer that put it all together.

The course materials are all online here.  One thing that was a bit different is that it wasn't just lectures -- we really did make more of a "summer school" by putting together a lot of (optional) exercises, and leaving time for people to work through some of them in teams.  I am hoping the result is a really nice resource.  There are lectures with the video online, and also the slides and exercises.  Students could go through whatever parts they like on their own, or people might find the material useful in preparing their own lectures when teaching graduate-level topics in hashing.  


Tuesday, August 19, 2014

Reviewing Scales

I'm just about finished reviewing for CoNEXT (Conference on Emerging Networking Experiments and Technologies), and am starting reviewing for ITCS (Innovations in Theoretical Computer Science).  One notable variation in the process is the choice of the score scale.  For CoNEXT, the program chairs chose a 2-value scale: accept or reject.  For ITCS, the program chair chose a 9-point scale.  Scoring from 1-9 or 1-10 is not uncommon for theory conferences.

I dislike both approaches, but, in the end, believe that it makes minimal difference, so who am I to complain?

The accept-or-reject choice is a bit too stark.  It hides whether you generously thought this paper should possibly get in if there's room, or whether you really are a champion for the paper.  A not-too-unusual situation is a paper gets (at least initially) a majority of accept votes -- but nobody really likes the paper, or has confronted its various flaws. (Or, of course, something similar the other way around, although I believe the first case is more common, as it feels better to accept a close call than to reject one.)  Fortunately, I think the chairs have been doing an excellent job (at least on the papers I reviewed) encouraging discussion on such papers as needed to get us to the right place.  (Apparently, the chairs aren't just looking at the scores, but reading the reviews!)  As long as there's actual discussion, I think the problems of the 2-score solution can be mitigated.

The 9 point scale is a bit too diffuse.  This is pretty clear.  On the description of score semantics we were given, I see:

"1-3 : Strong rejects".

I'm not sure why we need 3 different numbers to represent a strong reject (strong reject, really strong reject, really really strong reject), but there you have it.  The boundaries between "weak reject", "a borderline case" and "weak accept" (scores 4-6) also seem vague, and could easily lead to different people using different interpretations.  Still, we'll see how it goes.  As long as there's good discussion, I think it will all work out here as well.

I prefer the Goldilocks scale of 5 values.  I further think "non-linear" scoring is more informative:  something like top 5%, top 10%, top 25%, top 50%, bottom 50%, but even scores corresponding to strong accept/weak accept/neutral/weak reject/strong reject seem more useful when trying to make decisions.

Finally, as I have to say whenever I'm reviewing, HotCRP is still the best conference management software (at least for me as a reviewer).

Monday, August 18, 2014

Back to Work

Harvard classes start up in a few weeks, and officially, my sabbatical is over.  I'm back in my office, trying to get back into a Harvard routine.

I notice that I've been very light in posting over my sabbatical.  After my term as chair, I was enjoying being in the background, hidden away a bit.  I'm not sure if I'll get back into blogging -- it seems already to be a technology of the past -- but I figure I'll start again and see what happens.

So some short notes.  On my to-do list is to go cover to cover through Ryan O'Donell's new book Analysis of Boolean Functions; Cambridge University Press was nice enough to send me a free copy, which they do with books from time to time.  For those who have been following he's been releasing the book in chapters online, and you already know it's good.  He's made the book is available online also, but it's nice to have a copy for my bookshelf.  It's a beautiful book, both in content and in how it's all put together.  My one thought (so far) as I've started my way through is that it stylistically, to me, reads like a "math book" more than a "CS book", whatever that means.  That's not meant to be a complaint, just an observation.

Boaz Barak accidentally made me laugh on his Updates from ICM 2014 post, well worth reading, when he writes:
"Candes’s talk was an amazing exposition of the power and importance of algorithms. He showed how efficient algorithms can actually make the difference in treating kids with cancer!....

Hearing Candes’s talk I couldn’t help thinking that some of those advances could perhaps have been made sooner if the TCS community had closer ties to the applied math community, and realized the relevance of concepts such as property testing and tool such as the Geomans-Williamson to these kind of questions. Such missed opportunities are unfortunate for our community and (given the applications) also to society at large, which is another reason you should always try to go to talks in other areas."

I think the larger issue the slow but (over long periods) not really subtle shift of the TCS community away from algorithmic work and practical applications.  I'm all for going to talks in other areas, but I think the issue is a larger scale problem.

I'm working on a new class this semester, which if I write I'm sure I'll write more about, but one thing I'd forgotten is how hard and time-consuming it is to construct a lecture.  Maybe some of it is a function of me getting slower, but going through all the possible pieces, picking what you think are the right ones, making sure you've got all the details, and then (for me) writing a "script" of what you plan to go over -- it takes time.  (Good thing I'll likely be on this new class for a few years.)

Plenty more to talk about -- reviewing, some new/old papers, some summer travel.  So we'll see if I can get back into a blogging state of mind.

Friday, June 20, 2014

See You in Prague

For those of you going to SPAA this coming week, I'll see you there.  I'll be giving the last two talks at the conference, to what I expect (based on the timing) will be a nearly empty room.  That just means there will be no pressure.

If you want to hear more about the papers, you can go to Abstract Talk, where I talk about the papers.  Here is the link for the paper Balanced Allocations and Double Hashing, and the link for the paper Parallel Peeling Algorithms.  I haven't done podcasts for Abstract Talk before, so be forgiving if you go to listen.  It seems like a cool idea;  what do people think of it in practice? 

For those who expect to be sightseeing in Prague during the final session (or who just aren't going to SPAA), here's the brief overview. 

For Balanced Allocations and Double Hashing:
In the well-known balanced allocations paradigm, balls are hashed sequentially into bins, where each ball gets d random choices from the hash functions, and is then placed in the least loaded.  With double hashing, we replace the d random choices with d choices of the form a, a+b, a+2b, a+3b,... a+(d-1)b, where a and b are random values (determined by hashing).  That is, we build the d choices from 2 random numbers instead of using d random numbers.  (The numbers are taken mod the size of the hash table, and b should be relatively prime to the hash table size... let's stop worrying about details.)  We find empirically that this makes no difference, in a very strong sense;  the fraction of bins with load j appears the same for every value of j for both systems, so you can't really tell them apart.  We provide a theoretical explanation, based on fluid limit models, for why this happens.

For Parallel Peeling Algorithms:
The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than k left. The remaining hypergraph is known as the k-core.  We analyze parallel peeling processes, where in each round, all vertices of degree less than k are removed. It is known that, below a specific edge density threshold, the k-core is empty
with high probability. We show that, with high probability, below this threshold, only O(log log n) rounds of peeling are needed to obtain the empty k-core for r-uniform hypergraphs. Interestingly, above this threshold, Ω(log n) rounds of peeling are required to find the non-empty k-core. Since most algorithms and data structures aim to peel to an empty k-core  this asymmetry appears fortunate;  nature is on our side. We verify the theoretical results both with simulation and with a parallel implementation using graphics processing units (GPUs). Our implementation provides insights into how to structure parallel peeling algorithms for efficiency in practice.

The Parallel Peeling Algorithms paper was, to my surprise, awarded Best Paper.  Maybe I'll write up more about the surprise at some point, but I'd certainly like to thank the committee for the honor, and pass the credit to where it is due, with my co-authors Jiayang Jiang and Justin Thaler. 


NSF Thanks for the Year

It's that time of year where, as a background process, I have to do my annual reviews for the NSF.  It's generally not a very exciting task, and their online forms remain, I think, unpleasantly designed.  (I think they fixed a problem I had last year, so at least now they point out to you what you haven't filled out more clearly.)

That being said, this year, even more than others, I find myself gratefully filling them out.  The NSF provides the money for my (and my students') research.  And research is fun.  In my few years playing administrator, I was trying to keep up with my research, but inevitably my available time and corresponding output started to decline.  This year, I've been able to "get back into it", and it made me realize how much I enjoy it.  Sure it's often frustrating.  And writing it down (and dealing with conferences, reviews, etc.) can also be frustrating.  But overall the creation process of research, including all the frustrating parts, is the most enjoyable part of the job*, and I'm glad that the NSF is there to support it.  

Thanks NSF.  I'll try to have those reports in well before the nominal deadline....

[* Yes, I like teaching and interacting with students too -- otherwise I'd look for work in one of the many great research labs -- that's the other most enjoyable part of the job.]  

Monday, June 16, 2014

Sad News: Berthold Vöcking

I have just seen the news that Berthold Vöcking passed away.  For those who didn't know him, Berthold was an oustanding researcher in algorithms.  We had several shared interests and were of a similar age;  I always enjoyed his work, and very much enjoyed the few times that I worked with him.  His paper How Asymmetry Helps Load Balancing still amazes me, and is one of the most wonderful papers that I wish I had written.  When I first saw it I simply didn't believe the main result*;  I stopped my other work, wrote up a simulation, and found it matched his analysis.  I then had to spend the next few days understanding it.  Once I understood it, I wondered how he had seen it, and how I had missed it.  It is a truly inspired paper.  

Of course he has other great papers, including the well known Tight Bounds for Worst-Case Equilibria, and maybe the less well known but very interesting (particularly to me) work on Balanced Allocations:  The Heavily Loaded Case.  He had just won a best paper award for ICALP for work on Online Independent Sets.  He was one of the editors of Algorithms Unplugged, a wonderful book project. 

Berthold was regularly on my list of people to invite to workshops, because he always had very interesting work to present and was interesting to talk to.  I can't believe I won't have another chance to talk with him.  His passing is a loss to our community.  

* For those who care, his result was that when hashing using the "power of two choices", suppose that instead of having each new item make two random choices and then placing the item in the least loaded, you split the table into two equal halves (call them left and right), have each item make a random choice from each half, and then place the item in the least loaded, except that you always break ties to the "left half".  There wouldn't seem to be any difference between the two schemes, but there is;  the split-and-break-ties approach works significantly better.    


Saturday, May 31, 2014

Child Geniuses and Other Articles

Stuck on a flight home, I passed time reading a hotel copy of the Wall Street Journal, to find more interesting things than I would have thought.

What first caught my eye is an interesting piece by Jordan Ellenberg, a math professor who also writes a lot about mathematics for more popular consumption, about The Wrong Way to Treat Child Geniuses.  Especially for any readers who did any of those science/math programs as a child (notably John Hopkins/SMPY) -- but not exclusively -- it's definitely worth a read.

Then there were also interesting articles about Ray Kurzweil and an amusing article about entitled Ticket to Dine: the Restaurant Reservation Revoultion.  The last article described a business model that is simultaneously disturbing and why-didn't-I-think-of-it -- using bots to snap up reservations on OpenTable to popular restaurants, and then selling/scalping the reservations.  Evil, or genius?  You decide....



Wednesday, May 28, 2014

ITCS 2015

I was asked to announce the call for ITCS.  Here is a link to the call for papers, with the most important info:
ITCS (previously known as ICS) seeks to promote research that carries a strong conceptual message (e.g., introducing a new concept or model, opening a new line of inquiry within traditional or cross-interdisciplinary areas, or introducing new techniques or new applications of known techniques). ITCS welcomes all submissions, whether aligned with current theory of computation research directions or deviating from them.
 
Important Dates
Paper Submission Deadline: Friday,
August 8, 2014, 5PM PDT
Apparently in some fit of weakness I agreed to be on the PC.  
 
I think an ongoing question about ITCS is how well it lives up to its "mission statement".  Part of the question is whether ITCS is necessary -- do FOCS/STOC/SODA not do a sufficiently good job of accepting "conceptual" papers -- and sufficient(ly doing a good job) -- is it really focusing on accepting conceptual papers, or is it just the same-old.  So I guess I'll be able to look under the hood to see how it's going.
 
My current/upcoming PCs have been SIGCOMM and coNEXT (2014), and ITCS and ICALP part C (2015).  I'm feeling happily diverse in what I get to read.  
 
  
 
 

Wednesday, May 21, 2014

What's Important in Algoirthms

I saw this interesting article up on 20 questions with Don Knuth, worth reading just for the fun of it.  But the following question on work in algorithm design and analysis particularly interested me, and I nodded especially hard resonating with the statement:

Thus I think the present state of research in algorithm design misunderstands the true nature of efficiency. The literature exhibits a dangerous trend in contemporary views of what deserves to be published.
 
But here's the whole question/answer for context.

15. Robert Tarjan, Princeton: What do you see as the most promising directions for future work in algorithm design and analysis? What interesting and important open problems do you see?
Don Knuth: My current draft about satisfiability already mentions 25 research problems, most of which are not yet well known to the theory community. Hence many of them might well be answered before Volume 4B is ready. Open problems pop up everywhere and often. But your question is, of course, really intended to be much more general.
In general I'm looking for more focus on algorithms that work fast with respect to problems whose size, n, is feasible. Most of today's literature is devoted to algorithms that are asymptotically great, but they are helpful only when n exceeds the size of the universe.
In one sense such literature makes my life easier, because I don't have to discuss those methods in TAOCP. I'm emphatically not against pure research, which significantly sharpens our abilities to deal with practical problems and which is interesting in its own right. So I sometimes play asymptotic games. But I sure wouldn't mind seeing a lot more algorithms that I could also use.
For instance, I've been reading about algorithms that decide whether or not a given graph G belongs to a certain class. Is G, say, chordal? You and others discovered some great algorithms for the chordality and minimum fillin problems, early on, and an enormous number of extremely ingenious procedures have subsequently been developed for characterizing the graphs of other classes. But I've been surprised to discover that very few of these newer algorithms have actually been implemented. They exist only on paper, and often with details only sketched.
Two years ago I needed an algorithm to decide whether G is a so-called comparability graph, and was disappointed by what had been published. I believe that all of the supposedly "most efficient" algorithms for that problem are too complicated to be trustworthy, even if I had a year to implement one of them.
Thus I think the present state of research in algorithm design misunderstands the true nature of efficiency. The literature exhibits a dangerous trend in contemporary views of what deserves to be published.
Another issue, when we come down to earth, is the efficiency of algorithms on real computers. As part of the Stanford GraphBase project I implemented four algorithms to compute minimum spanning trees of graphs, one of which was the very pretty method that you developed with Cheriton and Karp. Although I was expecting your method to be the winner, because it examines much of the data only half as often as the others, it actually came out two to three times worse than Kruskal's venerable method. Part of the reason was poor cache interaction, but the main cause was a large constant factor hidden by O notation.

Friday, May 02, 2014

Reviewing Question: What's Important

Nick Feamster on Google+ recently shared (and gave me permission to blog about) the following review comment:
This review comment is super enlightening:

"I think there is a general problem with the overall goal of this study and area of research. This seems to be making data usage a critical thing people should be paying attention to. People have  real issues and I am not at all sure that this one deserves attention. They have real concerns like, are they going to lose their job, where should their children go to college, should they encourage their elderly parents to move into a retirement center."
It's so cringeworthy, it's funny.  You could substitute "data usage" with pretty much any research topic you feel like, and you have a review that's almost certainly accurate and justifies rejection.  Which is what makes it such a wonderful, terrible review.  

Without wanting in any way to excuse this reviewer, I do want to say that this review highlights an ever-increasing problem:  I believe review decisions are more and more becoming dominated by subjective decisions about what topics are "important".  I realize some may say it has ever been thus, and I acknowledge that the importance of the underlying problem has always been a factor in judging a paper.  I think the subjective judgment has become more significant in both in systems and in theory over the years for multiple reasons.  As the field has expanded there's less of an underlying agreement and common understanding of what's important.  Many times the reviewer may not know the area well enough to judge the importance, and there is every-growing potential for area bias:  the problems in my area are (more) important.  Further, there are far too many papers for the available slots so reasons to reject have to be found.  As the above comment suggests, one can always call into question the importance of the problem the paper aims to solve.   

But finally, I think, it fits in with an issue that keeps coming up for me:  reviewers are too arrogant.  If they don't see why the problem is important, then the issue must be with the research or the writing;  it couldn't be with their reading or understanding of the problem space.  Reviewers will have opinions regarding the importance of the works they read -- no getting around that -- and they should where possible give advice to authors on how to best present their results.  But they could often be a bit more judicious in recognizing that they are expressing their opinion and offering advice;  they are not, in the end, the final arbiter of a paper's eventual, actual importance.

I don't see subjective decisions in "importance" going away.  But I think they could be given a bit more care, both in how they are used in the final decision-making, and in how reviewers express their opinions on "importance" to authors.

If you'll excuse me now, I don't have time to blog further, I have to go plan where my children should go to college and where I'll eventually ship off my aging parents.  (Thank goodness I have a tenured position.)