Taming the Data Partie Deux: Open Law?

by Rodrigo

I just spent most of the last two days at the IGOTF non-conference. The meeting was both interesting and not too productive. On the other hand, it seems that it may have resulted in enough organization to yield tangible fruit in the not-to-distant future. We’ll have to see what happens in the next few months. My most liberal estimate is that we are at the very least a year away from a free, comprehensive, open, uniform, standards-compliant, well-documented repository of law. But hey, if a bunch of geeks could collaborate to make things like the linux kernel, I’m sure that a different bunch of geeks can make this happen. Of course, this one will involve a lot more talking to non-geeks than creating an free operating system, but Carl is the best person to lead the effort and given his previous accomplishments he just might be able to turn this into reality. Viva la open-source revolucion!

Why?

Rhymes, echoes and variations of the idea that it is hard to believe that the US with all its resources and infrastructure does not have a publicly accessible repository of its laws at all levels of government came up in several conversations. Of course, this idea is precisely what drives the existence of IGOTF and public.resource.org, but it is still one worth considering. When we were first considering starting Sonya Labs I was myself ideologically appalled by the fact that the only reliable way to access case law data is by paying for a subscription service. On the other hand, from several conversations I had, it seems that the “common wisdom” of DC is that the niche is so specific and within the confines of a big industry that the best way is to have it implemented is privately.

I can see how one could draw this conclusion. It is mostly lawyers and other members of the legal industry would be the ones to make use of such a resource and it is true that it being done privately could have benefits. Especially if it is costly and difficult to run the repository. This was likely the case in the 80s and maybe even the 90s, but with the current status of technology it certainly is not so. The main difficulty is just policy: courts and legislatures have to agree on formats and systems and if they do then the problem would become fairly trivial and cheap.

Of course, all this says nothing of the ethical elephant in the room. The law should be publicly available to everyone. Some may even argue that the law is publicly available to everyone, just not in electronic form. Bollocks. The ubiquity of the internet in our lives demands that publicly available resources are also electronically available. May the seal beat the elephant!

Where?

My perception of the state of affairs is that the best way to convince the powers that be that this sort of goal is feasible and has an audience is to start doing it. To this end there are four main problems that need to be addressed:

1. data availability and uniformity
2. effectively distribution of efforts
3. document identification
4. privacy issues

We’ll just have to live with 1. The best we can do is whenever a new set of data becomes available through some source (e.g. a court’s website starts publishing opinions) we need to be there ready to incorporate that into the repository. The better solutions we have to 2. the easier this becomes.

In the interest of brevity, I will save my thoughts about the other 3 for my next post.

Who?

Us. We at Sonya is and will be happy to contribute programming hands to this effort.

You. Join the IGOTF mailing list if you’d like to help.

What?

Oh yeah… and I got us a giant seal of approval fridge magnet, which we had been needing. And you thought that the linux logo was the cutest possible. Ha. This one is cuter and a pun. Thanks, Carl!

Up and running around

by Rodrigo

Rare is the day that sees me up this early unless I’m still up from the night before. But, today is special. For one, we had just moved and not unpacked. This will likely still be the case a couple of months from now. Now we thrive among hipsters and Italian eateries in Bloomfield. The new neighborhood is cool and all, but it wouldn’t get me out of bed at this ungodly hour on a Sunday.

The real reason for my being up is that I’m off to Chicago to attend the IGOTF conference sponsored by Public.Resource.Org. I will report on that in a couple of days. Until then.

Forget Silicon: How to Be Steel Valley — Can Web Startups be a ‘burgh Thing?

by Rodrigo

Of all places, I never would’ve expected to build my startup in Pittsburgh. I moved to the ‘burgh from Chicago when Sonya Labs got a seed-stage investment from AlphaLab. It is not so unfathomable that I’m here, though, it actually makes quite a bit of sense. Even Paul Graham, a Pittsburgh native, who is famous for advising that startups go to Boston or Sillicon Valley says so!

Pittsburgh has the opposite problem: plenty of nerds, but no rich people. The top US Computer Science departments are said to be MIT, Stanford, Berkeley, and Carnegie-Mellon. MIT yielded Route 128. Stanford and Berkeley yielded Silicon Valley. But Carnegie-Mellon? The record skips at that point. Lower down the list, the University of Washington yielded a high-tech community in Seattle, and the University of Texas at Austin yielded one in Austin. But what happened in Pittsburgh? And in Ithaca, home of Cornell, which is also high on the list?

Rich people don’t want to live in Pittsburgh or Ithaca. So while there are plenty of hackers who could start startups, there’s no one to invest in them.

Well, maybe Paul doesn’t quite say so, but he does try to explain why there isn’t a startup hub here. It simply makes sense for one of the top CS programs in the country to turn its home into one. His explanation is a bit flawed, though, no rich people in one of the steel capitals of the past century? Maybe there is no young money, but there has got to be money somewhere. Not only that, but Pittsburgh is close to all of that east coast money.

On the other hand, if all the hackers that study here are happy to leave and there is nothing to attract hackers to come, money doesn’t count for much. Pittsburgh seems to have more of that problem than a lack of money.

The puzzle goes a bit deeper. For a web startup made of cheap servers, ramen and young people, the main expenses to get started are rent and food. Pittsburgh then, being a lot cheaper than Boston or the Valley, makes even more sense. Here our AlphaLab money will last us at least six months and possibly even eight, whereas it would only last three or four in the other places.

Of course, this puzzle is a lot more complex than solvable with a single answer, but if I was a CMU hacker working on something startup worthy, there are just about enough incentives to stay in Pittsburgh; and if I was a mere mortal hacker there even seem to be enough incentives to come here.

Mike Madison hits closer to what I think is one of the main problems:

This puts the lie, I think, to the most famous and durable of Pittsburgh stereotypes, that this town and region are noteworthy for their honesty and work ethic. If you have a job, that stereotype certainly seems to fit — but a big part of that job seems to be keeping it intact, and keeping others at arms’ length. Is there a “Not Welcome” sign posted in the region’s employment markets? It sure seems that way.

I think that more than a “Not Welcome” sign, Pittsburgh has no sign at all. At least when it comes to startup founders. Since landing here, I’ve noticed the vibe of reinvention just about everywhere I’ve set foot. The city is and has been trying to renew itself into being a startup hub among other things.

Whereas I could always feel the academic attitude of Boston, the entrepreneurial one of the bay area and the alterno-hipster culture of Portland from afar, it was only after a few weeks here that I felt anything about Pittsburgh. I’ve been thinking about startups for the last three years and never once had Pittsburgh crossed my mind as a potential place, but places like Austin, Seattle, Chicago (before I moved there) and even Portland had. There isn’t enough noise!

How to go about hanging the right sign at the door is also tricky business. The first part would be a bit of a media campaign in the adequate parts of the blogosphere. I mean nerds and hackers blogging to nerds and hackers about nerdiness, hackery, and Pittsburgh. A bit of a catch-22, if they aren’t coming or staying here in the first place, though. Maybe we can help. We’re nerdy looking and we mumble “linux, “open source” and “python” enough to pass for hackers.

Programs like AlphaLab are also a good start. If the program is successful in a couple of years there will be a solid network of alumni founders in the city which will make their noise and attract more people in turn. Even more important than helping founders start is helping people who are already here continue, though. Ahem.

Self interest aside, another important piece to attracting founders is getting VCs to invest smaller amounts in more companies. Convincing VCs to invest angel-like amounts and angels to behave like VCs will be part of the next wave of web startups wherever it happens. The trend about the costs of startups are clear by now, so I won’t go into why most startups that are a few months old don’t need several million, but only several hundred thousand to move forward. If you want to give us a few million at the right valuation we won’t object, though.

Finally whatever “Not Welcome” sign needs to be done away with. Duh. There is no room for anti-immigration attitudes. Letting Mike make my point:

Moreover, there are sizable communities in the Pittsburgh region that see potential increases in immigration rates as undesirable — either because immigration of lower-skilled workers threatens existing blue-collar employment and depresses wages, or because in-bound higher-skilled workers compete for positions with people who already live here, or both. Somewhere in Pittsburgh, someone is asking why Sycor wants to raise the H1-B visa cap rather than hire skilled people who already live in Pittsburgh.

The problem, in other words, is that immigration is perceived by many as a threat to the pie that we already have, rather than as part of a process of growing the pie.

That one is much more difficult so I’ll let people like Mike work on that problem.

20/6

by Rodrigo

My laptop was stolen from the office yesterday while I was at lunch.  What fun would a startup be without any trials and tribulations?

We are lucky to be an AlphaLab startup.  I was still in a state of denial when
Jonathan from gamehuddle had offered to let me borrow a laptop so work could go uninterrupted and today at our weekly meeting the AlphaLab team told me that they will gladly help Sonya Labs cope with the expense of a new laptop. And I thought that I’d have to roll with the punches!

Anecdotes of stolen laptops aside, though, AlphaLab is a truly brilliant program. There is the whole Y-combinator style “incubator” which has been plenty celebrated and applauded in internet circles and all the benefits that come with that. The style is so effective that similar programs have popped up in several cities and AlphaLab offers plenty of that. But, there is one thing that AlphaLab seems to have as a very clear advantage over some of the other similar programs. The other programs seem to be about 20 startups per class working with a team of 6 advisors. Since AlphaLab is a program of Innovation Works. It is actually 6 startups with a team of over 20 experts in all the relevant fields as advisors. Need PR advice? Talk to Terri. Have a quick legal question? Talk to Deborah. Goes without saying that most of the advisors are or have been entrepreneurs themselves.

And there is, of course, the office provided:

IMG_0471

Now if only Silas would man his laptop and get some work done instead of taking pictures…

Pesky New Things

by Rodrigo

Via ReadWriteWeb I found the following gem of a quote:

Eleanor Coner, the SPTC’s information officer, said: “Children are very IT-savvy, but they are rubbish at researching. The sad fact is most children these days use libraries for computers, not the books. We accept that as a sign of the times, but schools must teach pupils not to believe everything they read.

“It’s dangerous when the internet is littered with opinion and inaccurate information which could be taken as fact.

This sounds a little bit like someone from the early 1900s being upset that people are learning how to drive and forgetting how to ride horses and citing the fact that roads can be dangerous as evidence.

Sarah makes several good points about this in the RWW article and from one of the links:

One comment on The Scotsman makes a fair, if tired, point:
Easier to blame Wikipedia than the fact that you’re poor parents and your children are out partying or playing video games.
Inaccuracies are found in standard encyclopedias (and newspapers) too. And besides, don’t your schools provide textbooks?

Of course there is good and bad information out there, but that is just as true of printed materials. Critical thinking is a skill, not a property of the communication medium of choice. It should go without saying that as a communication medium the internet is much more powerful, flexible and overall superior than print. Granted, there are still things in print that one can’t find on the intertubes, but that is bound to change in the not-too-distant future when google finishes scanning every book in print out there.

Now, if we could only convince the powers that be that the same applies to case law …

Taming the Data Partie Une

by Rodrigo

It is French for part one. There will be many parts to the series, I’m afraid.

The last couple of weeks have seen us trying to write a parser to get some case data into our own database in our format. The resource.org data looked very clean and structured so “piece of cake” thought Rigo and Nathan. Before we started we had some learning to do: we needed to pick an sql python orm and a library to parse html for us.

I had heard good things about beautiful soup and tried to use it for the same task back in march when we were writing our proof-of-concept prototype. The first time the parser ran it was apparent that it wasn’t fast enough and after some simple timing of the different parts it became obvious that the soup was too slow. So, this time we started by looking at the different python xml libraries and doing some crude timing tests. Following the suggestion of Ian Bicking and because we liked the fact that it had a special-purpose html parser we picked lxml for our job. Fast forward a couple of weeks and it seems that it is fast enough.

As for the orm, we looked a little bit into the usual suspects, but we decided to just use the django orm with a hack that Nathan cooked up to be able to use it with or without the rest of the framework.

Two days went there.

Then we started looking at the data and it seemed straight forward enough: html, all the parts we care about clearly marked with distinctive names and we were off to the races. As soon as we had something that appeared to be working, the first task was to make it paranoic: either it gets the data it expects or dies. Fast forward ten days and our parser was still dying on 90% of the data. It was a big and rich piece of cake, I guess. It turns out that the most regular data set we have is quite irregular and has many exceptions. We still don’t quite understand how this happens, it seems that the aim of resource.org is to store the data in a very uniform way. Somewhere in our TODO is looking at their parsers and seeing if we can identify the problem(s) and give a hand in fixing them.

A full work day after the 90% failure rate we were up to 98% success. Of course, this just means that we made the parser and our format less and less restrictive until we could parse enough data. There is a very clear trade-off there that is fairly general: how much structure you keep vs how much time you spend trying to handle all the nuisances that come with trying to parse that structure.

There are still things to fix, but we decided we could put this a bit closer to the back burner and we moved on to other tasks with the 98% of Supreme Court cases in our database. The F.2d and F.3d volumes of the Federal Reporter come from the same source so soon enough they should also be in our database.

Sleepless in Seattle

by Rodrigo

If legal research is, like Walter said, still wearing flannel and humming smells like teen spirit, then currently I am sleepless in Seattle. No joke. It is 8:48am and I am still at the office since yesterday and that is not the record either. This I do in direct contradiction to what people much smarter than me recommend. Heck, I even heard DHH speak at startup school earlier this year and largely agreed with just about everything he had to say. And don’t get me wrong, I really do value having a balanced lifestyle, but …

I have always been a bad student. Well, not quite. My grades were alright enough to get me both into college and graduate school, but I was notorious in college for never going to class. The first time I applied the all-out technique was freshman year when I taught myself the third quarter of calculus in three afternoon/nights at the library after not having attended all quarter. I did not forget any of it, either. The laser-sharp focus I had towards learning the class at the time worked really well for me.

The all-out approach has served me well and plenty since then. Almost every subject I’ve learned well during my college years I’ve learned through bursts of laser-focused effort. Ditto for projects. I did not care about sleep too much back then. My body is now older, though. The last time (before Sonya) that I’ve applied the all-out paradigm has been when I took my qualifier exams last summer. For about six weeks I did three things: eat, sleep, and study physics for the quals. Yes, you read right, sleep was in there and it worked. The trick was that I lived as my body dictated while I was doing that. I got rid of my alarm clock. When I got tired I came back from the coffee shop and slept until I woke up. When I got hungry I got food. When my back felt like I had been sitting for too long I took a walk. When I just had to play guitar I did. When the time to take the quals came I passed. Sleepless 2.0.

And here am I at it again. If not all-out why do it at all? I’ll be laser-focused and Sleepless (2.0) in Seattle until legal research is living in Oregon, downloading Jack Johnson songs off of itunes and wearing crocs. Ironic that lasers are used to play CDs, isn’t it? That is so grunge, dude.

The Structure of Law

by Rodrigo

In this day of xml, social networks, user generated content and the rest of web2.0 buzzwordery it is easy to forget that legacy data exists and that it usually is not in any kind of standardized format. Furthermore, there is no chance to ask for something better. Developers get to say “thank you” and have to make sense of the mess. Thankfully, several groups, most noticeably resource.org, have already taken a huge leap for us. Nevertheless, the data is still far from perfect and far from complete. Sonya has a lot of work to do on that front.

Case law has structure. A lot of it.

First, opinions are organized in certain ways. Cases are cited in particular formats and on top, there are docket numbers, parties and dates to help identify them. Then there is the actual opinions that make up a case which can be broken up into paragraphs. It is unfortunate that people are used to citing cases using page numbers instead of paragraph numbers, but the AALL is already working on that.

Second, citations turn the law into a network. When a case refers to another case it makes the latter a bit more important — even more so if the former is already an important case. Much in the same way that when a web-page links to another it makes the latter more relevant. (Google made billions off this idea!) The law is even nicer because the citation network is topologically sorted. That’s just geekspeak for saying that the network that is generated when one connects related cases has arrows pointing in one direction: backwards in time. If we figure out a way to cite cases that haven’t happened yet, we may switch our primary focus. :-)

Third, case law can be organized by legal issue. Certain cases may have more citations and thus be overall “more important”, but they may not have anything to do with the issue at hand. For a human extracting the relevant issues from a case takes no more than reading the case with some care. Unfortunately, computers are not yet that smart and this is a difficult problem. Rest assured, one that Sonya is trying to solve.

Fourth, case law is organized by lawyers researching it. As a lawyer gathers cases to cite as precedents while preparing a case he does not go browsing the law at random. There is structure in his thought process which relates to the structure the web of law has itself.

Once the legacy data is tamed, all this structure can be captured to create an interface which helps lawyers do research more effectively. That’s Sonya’s ultimate goal.

Who is Sonya?

by Rodrigo

Sonya is wisdom. Sonya is Uncle Vanya’s niece.

4:03 AM Nathan: i'd be interested in taking over the world

That’s from my gchat log from November 22, 2007.

Sonya is executing our ideas.

For a couple of years now I have been meaning to start a company. By February, I had convinced Nathan that doing so is a good idea and he acted as the hub for us to partner with Walter and Silas. We started without even a clear idea of what our company would do. After a few hundred emails on our mailing list, Nathan and I put our PhD programs in what very well may be an indefinite pause, Silas turned down jobs, and here we go. Now we find ourselves in a new venture with a clear goal, more tasks than I can keep count of, staring at code for twelve hours a day, incorporated and moving.

Sonya is Inc.

We’re based in Pittsburgh. We live in Oakland (the neighborhood) and the office is in the trendy south side. If one was up at such ungodly hours, one can usually find us there coding our alpha version and kicking a soccer ball at 3am.

Sonya is our latest endeavor. Sonya is us.

|