IT optimisation the Lucasfilm way

IT optimisation the Lucasfilm way

As you might expect, the demands on its IT group are significant, given the computing horsepower it takes to enable the likes of Johnny Depp to ward off sea creatures with creepy, octopus-like heads.

Lucasfilm is the creative force behind a host of special-effects-laden motion pictures, including the Star Wars, Indiana Jones and Pirates of the Caribbean series. The firm has six divisions in addition to the parent company: Industrial Light and Magic, the special effects group; Lucas Arts and Entertainment, the gaming division; Lucasfilm Animation; Skywalker Sound; Lucas Licensing; and Lucas Online. The company operates from three locations in the San Francisco area and the Lucasfilm Animation facility in Singapore. A staff of 57 IT professionals provides network and IT services for the company, which numbers about 1,200 employees. As you might expect, the demands on that IT group are significant, given the computing horsepower it takes to enable the likes of Johnny Depp to ward off sea creatures with creepy, octopus-like heads.

Kevin Clark (KC), director of IT operations for Lucasfilm, and Peter Hricak (PH), senior manager for network and telecommunications, explain how, even with a server farm of more than 4,000 machines and a WAN with 10Gbps links, ways still can be found to optimize.

Can you describe your network set-up?

Peter Hricak: For our campus networks we have three network cores, each based on a pair of 10G Ethernet chassis-based routers with a total of 128 10G ports. All desktops are usually linked at 1G to edge switches, which we connect to building distribution cores with two 10G interconnects. The building distribution cores then aggregate to the network core with four 10G interconnects each. Storage is directly 10G connected; we try to get as fast a path to the storage as we can.

On the WAN, we have two OC-3's connecting our campuses in the Bay Area and another to Singapore. We also have 10G dark fiber between two of our Bay Area campuses, as well as a 10G dark fiber line to a telco hotel in downtown San Francisco.

PH: The essence of the traffic is the work in progress that's being transferred and worked on by artists on a day-to-day basis.These are generally large image files, movies. We do frame-accurate motion jpeg on our transmission, so they're not very compressed. They are rendered at night by a render farm for ILM, then reviewed the next day, and more changes are made and the cycle starts again.

What does the render farm consist of?

Kevin Clark: We've got approximately 4,300 processors available within the data center. We use a distributed rendering model, so we've got a core within our data center of varying generations of systems, but primarily dual-core, dual [AMD] Opteron blades with up to 16M of memory on board. We also use available workstations that are out on the floor [such as after artists log off for the night]. Those are typically single-core or dual-core, dual-Opteron HP workstations. So the render farm in total comprises about 5,500 processors.

How does the rendering process work?

PH: We take models and textures and through mathematical equations -- sometimes through off-the-shelf software, sometimes through our own -- we render the final images. On the more difficult effects like water, what goes in is textures and some general physics equations, and what comes out is a two-minute sequence of a boat being [swamped] by a wave.

How do you go about optimizing an environment like that?

KC: It's kind of a brute force approach in that you utilize all resources that are available. We're looking at making that process more efficient by utilizing multicore processors. Also, from a power-efficiency perspective, we've given some attention to the type of render nodes that we utilize. Our render nodes are diskless blade servers, 66 of them per cabinet. The cabinet plugs in straight to 480V AC power. We convert that to 480 V DC, then distribute 48 V DC to each node in the cabinet. So we're bypassing our PDUs [power distribution units]. There's less energy loss in stepping down directly from 480 to the nodes vs. if we step it down to the PDUs at 240 V and then distribute out from there.

Another way we optimise is by staying current. That means utilizing multicore and whatever other resources are available.

In terms of optimizing our storage, we do deal with a lot of storage online. It's up to about 300TB online now, maybe just under. A lot of that is active data. Once a shot is completed and final, where a director says, "OK, that works for me," we'll archive that and remove it from the storage cluster. One of the problems that we run into, these shows grow in terms of complexity, and require more and more render and storage utilization. For example, when we did ["Star Wars"] Episode 3, back in 2005, that took up about 29TB of storage on our storage cluster. "Pirates" 2 went up to 60 and "Pirates" 3 went up over 100. As that continues to grow, we could just continue to add disk, but it's not all that cost-effective. So we're really trying to work on how we can be more efficient in terms of workflow and our pipeline utilization so we can get that data offline quicker vs. just adding disk.

PH: Another optimization effort is the appropriate retirement of old equipment. We quickly realized that you end up spending more in service, support and power for what after three or four years becomes a pretty small computer that can be replaced with new hardware that's more efficient. We can replace four racks' worth of equipment with one rack of new gear. That clearly shows a savings on the power front year after year.

Do you have a figure for when your equipment is fully amortized?

KC: We typically work on a three-year cycle for depreciation. But we have a refresh cycle that is much faster than three years. We'll refresh systems for a specific artist or discipline every 12 months, sometimes less. We might be able allocate those two older workstations out for a different discipline that's not going to need the same amount of memory or the same processing requirements, or we can reuse that for some type of administrative task.

What kinds of things have you done to optimize your wide-area links and get the most out of them, even taking into account that they're 10G Ethernet?

PH: We have a dark fiber line to a telco hotel. By having a presence in a major hub point for many carriers, we find that the last-mile costs [are measured by] the number of feet we are from a carrier that we need to connect to. We've managed a good deal of efficiency there. With the dark fiber, we run a 10G link with virtual LANs and MPLS, so we're able to bring in a variety of services on the same link. We can have telephony, private data to another studio, and public Internet services all running on the same high-capacity pipe. Without having to build a last mile for each of these carriers, it's much easier to bring in services rapidly and cost effectively.

What about on the campus -- any network optimization efforts there?

PH: We're a 100 percent VoIP shop at this point. One of the advantages that's brought us is that by bringing Power over Ethernet down to the port, we've managed basically to implement standards-based, 48V DC power to every desk, for everything from access points to telephone sets. I'm looking forward to when laptop manufacturers start feeding power off of Ethernet. So, we're in a pretty good place with very efficient power distribution, as a side benefit of having VoIP.

How are you dealing with power and cooling issues in your data centers?

KC: We're pretty aggressively pursuing virtualization options to reduce the number of physical servers. I'm sure everyone else suffers from the same thing where you've got maybe 10 different types of servers, whether they're FileMaker or some other type of application, but they're highly underutilized. We're working on consolidating those where we can. We're also pretty aggressively looking to retire some older render systems that we know aren't nearly as power-efficient. We're going to pull those out and replace about 17 racks of [AMD] Athlon-based render processors that are 4 or 5 years old with a single rack of the newest-generation dual-core, and soon to be quad-core, dual Opteron blade render systems. One of those will replace 17 racks: That says quite a bit. So we can save both on power efficiency, as well as the heating/cooling perspective.

What kinds of things do you do to optimize the performance of your various Web sites?

PH: What we've done isn't as much as we're planning on doing. We are upgrading the hardware, getting them on a platform with many fewer servers than they currently have. We're also looking at TCP-flow optimization through some network vendor products, as well as some caching. Flow optimization really helps the server count. What I was doing with 10 servers, I can now do with four. That's just through straight TCP optimization of the protocol, keeping connections open instead of closing them down all the time.

What tips do you have for folks that need to optimize wide-area bandwidth?

PH: We opened our Singapore office a couple of years ago, and it was really the first site that was farther than a few kilometers from the data center. So we immediately fell into a lot of problems with the connection-oriented nature of TCP and the latency that's inherent in a 7,000-mile line. In terms of optimization there, the biggest bang for our buck comes from tweaking the application, because a well-written application doesn't care. So we have application engineers rewriting stuff, caching as much as they can, consolidating as much of their queries into one packet stream as much as possible. We've definitely seen more benefit there than anything else.

What's next? Do you have any other big optimization efforts on the horizon?

KC: From a systems perspective, it comes back to current technology. We're looking at adopting the quad-core technology when it comes out from AMD -- we're an AMD-based shop. We're going to also look at our storage challenges and figure out how best we can manage that

PH: In the future on the networking side, we're also looking at going with higher-speed connections from smaller carriers. They tend to give us more bandwidth for less money and about the same reliability as traditional carriers, for T-3's and OC-3's. So we'll be trying to upgrade our WAN to wave-based technology. The bang for our buck is extraordinarily high on that.

Network World

Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the newsletter!

Error: Please check your email address.

Tags moviesstoragedata centresserver farmnetwork

Show Comments