Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

History of Grid Engine Development

News Grid Engine Reference Recommended Links SGE cheat sheet Commands Grid Scheduler Documentation
Starting and Killing SGE Daemons Submitting parallel OpenMPI jobs Monitoring and Controlling Jobs Monitoring Queues Creating and modifying SGE Queues Job or Queue Reported in Error State E
qconf qstat qmod qalter qsub qhold
qacct qdel qhost qrsh
Managing User Access SGE Parallel Environment SGE Submit Scripts Submitting binaries in SGE SGE hostgroups Message Passing Interface
SGE Consumable Resources Resource Quotas Restricting number of slots per server Gridengine diag tool    
Installation of SCE on a small set of multicore servers Installation of the Master Host Installation of the Execution Hosts sge_execd - Sun Grid Engine job execution agent SGE shepherd Usage of NFS
Troubleshooting History Glossary Tips Humor Etc

Introduction

Grid Engine has a complex history. It was bought by Sun and for several year was developed by Sun which changed its name to Sun Grid Engine. Sun successfully marketing it and due to this (as well as good design and capabilities) the product became quite popular. Just before the acquisition by Oracle Sun released it as open source providing all Unix platforms including Linux with enterprise class batch system. 

The initial period

As Fritz Ferstl describes in his History GridEngine.com (03/27/2012)

In 1990, Wolfgang Gentzsch formed the company Genias Software in Neutraubling, Germany (near Regensburg) and initially focused on vectorizing and parallelizing technical and scientific application software code. Genias learned quickly that often code could be run "as is" distributed in a cluster to obtain the same result. A resource management tool was required. In 1991 Fritz Ferstl joined Genias as a lead developer and in 1992 began to work on resource management software called Codine, which was based on DQS from Florida State University. Fritz was instrumental in evolving Codine over the years and deriving GRD (Global Resource Director) in 1996. GRD was the result of collaboration between Genias, Raytheon and Instrumental Inc.

With GRD in hand, Genias’ business began to grow rapidly and in 1999 they merged with California-based Chord Systems and renamed the company GridWare. With the business scaling quickly they soon caught the attention of many and were acquired in August of 2000 by Sun Microsystems. That same year Sun renamed the product Grid Engine and released a free version for Solaris, followed by a version for Linux in 2001. Also in 2001, Sun released the source code thereby cementing the beginning of a rapid adoption phase for Grid Engine.

Being part of Sun created a number of opportunities and changes for the team. Fritz Ferstl was put in charge of the development of the business, while the company bundled the software with its leading server systems that shipped into Solaris strongholds - namely Financial Services, Semiconductor and Oil & Gas customers. As additional fuel, Sun provided free qualified courtesy binaries and offered superior community support to organizations that adopted Grid Engine. Many organizations began to use Grid Engine with the comfort that a multi-billion dollar company was supporting them.

The number of deployments of Grid Engine grew rapidly and surpassed 10,000 data centers when Oracle acquired Sun in January 2010.

Sun Period

Here is a short history of the product from Wikipedia:

In 2000, Sun acquired Gridware, Inc. a privately owned commercial vendor of advanced computing resource management software with offices in San Jose, Calif., and Regensburg, Germany.[23] Later that year, Sun offered a free version of Gridware for Solaris and Linux, and renamed the product Sun Grid Engine.

In 2001, Sun made the source code available, [24] and adopted the open source development model. Ports for Mac OS X and *BSD were contributed by the non-Sun open source developers.[25]

In 2010, after the purchase of Sun by Oracle, the Grid Engine 6.2 update 6 source code was not included with the binaries, and changes were not put back to the project's source repository. In response to this, the Grid Engine community started the Open Grid Scheduler project to continue to develop and maintain a free implementation of Grid Engine. [26] [27] [28]

Here is a quite from Gridwiki (GridWiki)

(The news below refer to the old Sun sponsored project and are no longer of much relevance. Some of the links may be dead.)

The last Sun release of open source version of Grid Engine was version 6.2 update 5 (ogf.org).

23.09.2008  6.2 major      SDM, scalability (> 60000 cores), AR, IJS
18.12.2008  6.2 update 1 maintenance release
31.03.2009  6.2 update 2 GUI Installer, JSV, Per Job Resources, jemalloc
23.06.2009. 6.2 update 3  SGE Inspect, SDM Cloud Adapter, Exclusive Host
23.10.2009  6.2 update 4 maintenance release
22.12.2009  6.2 update 5 Slotwise Preemption, Core Binding, enhanced Inspect, Java JSV, Array Job
Throttling, Hadoop Support

Development continued after that till version 6.2u7 but for last two releases source code was not made availble.

The version 6.2u5 -- SGE classic

SGE version 6.5u5, released by Sun at the end of 2009, plays special role in the whole history of development of Sun Grid Engine. It became a classic version of SGE. It is the last open source version produced by Sun and this is the most popular version deployed as of 2014. Due to mainly external circumstances (Oracle acquisition of Sun) it became standard de-facto implementation of SGE, the implementation against all other are judged.

The main difference with the versions of SGE floating around is not functionality, which is adequate even in version 6.2u5, but reliability as version 6.2u5 has a set of bugs which now, after four years of usage, are pretty well known. But they are not easy to fix.

SGE 6.2u5 also has some shortcomings as hardware significantly improved since 2010 (with 20 core blades a commodity and 320 core cluster in a single blade enclosure) and Linux somewhat improved too, but is is a lesser concern.

The main addition that various groups attempted was support of Linux cgroups.  Which is now implemented in Univa  and Son of Grid Engine.  Later some development efforts are observed in adding the support of Linux containers, but as linux containers themselves (they are attempt to replicate Solaris zones in Linux) are still pretty new. But theoretically it allows, for example, to move a container from one server to another to free computational facilities for a large and urgent job, which is a real problem for large cluster installations.  Kind of "container-based preemption".

Oracle Acquisition

After Oracle acquired Sun it automatically got Sun Grid engine as a part of Sun software portfolio. It quickly renamed it it into Oracle Grid Engine.

Oracle was not really interested in HPC and while it owned Grid Engine neglected it, depite the fact that it could integrate it into their own version of Linux making it the primary platform for HPC.  As Fritz Ferstl  recollected (History of Sun Grid, 03/27/2012)

By the end of 2010, Oracle had closed the open source community, stopped shipping source code, increased the license fees and essentially eliminated the HPC business that Sun was famous for. Oracle's decisions created a vacuum in the market whereby organizations using Grid Engine were faced with decisions as a result.

Oracle tried to market it as part of Oracle Enterprise manager 11G which is an interesting product, but in architectural quality and significance is far below the level of SGE.  It is regular "run-of-the-mill" monitoring system specialized for Oracle database. Grid Engine was made the part of the Ops Center, a business unit whose culture is not in sync with Grid Engine. The first Oracle release is 6.2u6. The last was  on  is 6.2u7 (June 2011).

After that development stalled and version 6.2 Update 7 was in place until the very end of ownership of product by Oracle. And binaries provided  by Oracle were really old even in 2012.  Oracle version has some tools (for example qmon) that work only on RHEL 4 (qmon requires openmotif-2.2.3 libraries that were last provided for RHEL 4).

The problem was that Oracle never has real interest in the product which for them was a niche product without much profit potential. So it was only natural that in late 2013 it abandoned it. Oracle sold its customer base, source code and trademarks to Univa which provides support for existing Oracle customers.  For Univa source code was not of much value as their fork of 6.2u5 at this point significantly deviated from the part of Sun developers who produced version 6.2u6 and 6.2u7, but at the same time there were not intended to release it as an open source code as they (legitimately) were afraid of competition from this open source product. So they just killed it.

Oracle which vandalized Sun web sites after acquisition here behaved really terribly and removed everything connected with Oracle Grid engine from their site.

Most customers have perpetual license and can go forward without support as codebase is very stable (although binaries are very old, compiled for RHEL 4.x)

Oracle never was interested in HPC area and eventually abandoned the product in 2013. What is bad is that those vandals destroyed Sun website and discarded many valuable materials. 

Univa acquired Oracle codebase
but discarded their version in favor of its own (fork of SGE 6.2u5)

On January 18, 2011, it was announced that Univa had recruited several principal engineers from the former Sun Grid Engine team and that Univa would be developing their own forked version of Grid Engine. The newly announced Univa Grid Engine will include commercial support and would compete with the official version of Oracle Grid Engine.  Initially they plan to produce open source "base" version and closed source "professional" version following the path of SUN but they quickly changed their mind. Which probably was connected with the change of the CEO (The truth about Univa Grid Engine - Google+ ):

Look What they have Done to Grid Engine!

If you are looking for the truth about what is going on in the Grid Engine world then you have come to the right place. Ever since the company searching for a product called Univa came into the picture last January the entire community has been in disarray. Hopefully this website and other sites in support of the open source community will help clear this up for you and anyone else wondering what’s going on with Grid Engine Currently there are four main forks of Grid Engine.

How free is free? Totally open source Grid Engine still requires some resources. So will anything. But given that the free versions are free to obtain and free to support from multiple places with very dedicated maintainers with a deep knowledge of GE, it’s as close to free as you can get in this world.

Univa seems to want to scare you into thinking that using open source software will cost too much. That’s a similar argument that Microsoft used to make about Linux. It can certainly be true but with even a small degree of competence it can be pretty easy to see that G.E. can be had and supported for a very low cost from someone other than Univa. Who offers the best commercial support for Grid Engine? Oracle has one team that develops Grid Engine. They have another team that support it.

Univa seems to want to keep secret how many people they have but we have uncovered that they have a team that does some development but they also do some support as well. As for the support team, we have found that they have one person. That’s one person to support all of the Univa Grid Engine customers. Wow!! And who does Q.A. testing? Good Question but I best Oracle has a team for that as well. With so few people how can Univa support, test, and develop a quality product?

This can mean one of two things. Either Univa has no customers to support OR they are not able to support the customers they have. What about bug fixes? Univa has pasted all over their website that they have bug fixes. Why do they focus on bug fixes so much and why are they creating software with so many bugs that need fixing? This never seemed to be that big of an issue before Univa Grid Engine came along. Could it be because of the issues outlined above?

OGS has bug fixes too and has far fewer problems than Univa Grid Engine has. What about Univa FUD? There has been a bit of an uproar in the community about some of the tactics that Univa uses in their pursuit of customers. One such tactic is that of software licensing. They seem to claim that many users of free G.E. are using it in violation of software licensing and that they could be sued by a company such as Oracle over this. They also seem to claim that by purchasing Univa Grid Engine they will be protected. We looked into this a bit and found not one case where Oracle has ‘gone after’ anyone.

We also have to ask if there is an agreement that allows Univa to sell Univa Grid Engine commercially from what is essentially an Oracle product. This also makes us wonder if Univa could get clamped down on by Oracle. What would happen to a Univa Grid Engine support contract if Oracle shut down Univa? We have to conclude that the Univa FUD is just that, FEAR. They seem to want you to be afraid so that you will buy from them. Why does it seem that Univa is attacking an open source project? This takes a bit of research. It seems that Univa is a company that has merged with another similar company called United Devices. Each has a history that runs to the early 2000’s. Each has gone through many rounds of funding which usually indicates that the company is unable to become profitable. Unprofitable companies tend to not survive and the best people in these companies tend to vacate which tends to leave only the ones that aren’t really smart enough to see what is going on, or care. In hopes of saving their jobs, the management teams in these companies will often do anything they can to try to make money and convince investors to keep giving them more money. Further research shows that in late 2010 Univa changed CEO’s to a guy named Gary Tyreman.

This is a guy who apparently left Platform Computing on some very bad terms and seems to have had it out for them ever since. Platform Computing has LSF, which is a competitive product to G.E. So it looks like Tyreman decided to take G.E. and tried to chase Platform Computing customers with it but that probably didn’t work so well so the next target was the poor free G.E. users. In order to get these people (10,000 users according to their website) they had to hinder the free open source community and make these users become afraid. It’s obvious that the biggest competitor to Univa Grid Engine is not Oracle but rather the free open source community.

Univa seems to veiw this community with distain and it seems that they would really like to eliminate the entire community and have everyone pay to use Univa Grid Engine. So, to put all of this together you have a CEO who has to keep investors ignorant to the fact that he can’t make them any money. You have a user group that has no real voice and is thus an easy target for an unscrupulous person such as Tyreman. Thus Univa creates Univa Grid Engine and discredits the open source community and then claims that these people are potential customers to the Univa investors in hopes that these investors pump more money into the company to make guys like Tyreman rich.

Now surely Univa will deny this and do everything in their power to have this truth stuck down but the truth is still out there. Look around, ask around, and search on Google. You will see the history right there in articles and even on the Univa Website. Look at the community Grid Engine website and look at the posts from the community. These people are upset with Univa and feel that they have been bullied by them. Ask questions. Why is Univa doing things that seem sound on the outside but if you dig a bit deeper it looks shady. Why does Univa use testimonials from a guy like Scott Clark? Clark is a supporter of Univa Grid Engine on the website but if you research him on Google you find that he is from Broadcom. Further research on Broadcom shows that they use LSF from Platform so what's the truth Mr. Tyreman? Why does Univa use the same testimonials on your website twice?

Why does the Univa website use testimonials from people that work at Univa? Why does Univa try to claim so many fixes yet many of them are the same things that OGS engineers have added as well. Why do members of the Univa management team put down enhancements made to Grid Engine by dedicated engineers working on it for free on the user forums? The more research that is done on this company and Univa Grid Engine the more questions come up. Where to from here for Grid Engine? Right now there is little doubt that open source G.E. is under pressure. There seems to be confusion about what exactly Oracle will do with their version and the free versions are struggling to find their voice. Univa Grid Engine had great promise but by closing the code, scaring customers, and attacking the open source community Univa has done great harm to the project. It does seem now that the community is producing some very dedicated players that are forging small companies that offer commercial support while fully cultivating the open source project.

Hopefully that will continue. GridEnginetruth.org will continue to monitor this situation and promote the open source community and provide a fair and truthful assessment of Univa Grid Engine.

Go to www.gridenginetruth.org for more info

On Oct 22, 2013 Univa has announced that it had acquired the intellectual property and trademarks pertaining to the Grid Engine technology and that Univa will take over support for Oracle Grid Engine customers. This move made Univa monopolist of commercial market of Grid engine.  

They mainly acquired Oracle customer base. But as they license their product per core (the  price was $99 per core per year The Register), not per server they are having trouble to preserve this user base as Oracle licensed thier produced per socket (per CPU) and this change made significant hike of maintenance fees.  Their version does not provide sizable advantages over Oracle Grid Engine and as Oracle sold the product with perpetual license existing Oracle customers can use it without support "forever".

 For large enterprises wth complex computational clusters, proprietary version marketed by Univa does not currently show much promise in comparison with ^.2u5, but is the only game in town with tech support. Which is on regular industry level, which is to way that it is pretty mediocre.

Attempt to market open source version with paid support failed

A team of contributors to Sun open source version of SGE tried to produce a version on similar conditions called Open Grid Scheduler (http://gridscheduler.sourceforge.net/)

Baseline code comes from the last Oracle open source release with significant additional enhancements and improvements added. The maintainer(s) have deep knowledge of SGE source and internals and are committed to the effort. No pre-compiled "courtesy binaries" available at the SourceForge site (just source code and instructions on how to build Grid Engine locally). In November 2011 a new company ScalableLogic announced plans to offer commercial support options for users of Open Grid Scheduler.Support: Supported via the maintainers and the users mailing list. Commercial support from ScalableLogic.

Their attempt failed. Unlike Linux, Apache, MySQL and PHP  open source version failed to provide substantiation support revenue stream.

Dave Love developed his Son of Grid engine version of SGE

Initially Univa SGE team planed to follow Sun path and produced both "core" open source version and "enhanced" professional paid version. And they started along this path produced open source version of Univa grid engine posting Univa open source version 8.0.0.  But with the change of CEO who came from IBM, they changed their mind and switched to fully commercial software development path. Open source version soon disappeared from the Web 

Dave Love decided to save it and started development of his own SGE distribution which he aply named Son of Grid Engine.  On July 27, 2011 he posted on the Grid Engine mailing list (on 27th of ) that he compiled a set of the Univa 8.0.0-tagged github code, but without ARCo.   It still can be found at: http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0a/. It doesn't contain the GUI installer and has no support for Hadoop and the Java interface (libjgdi) in all versions. Architectures are Linux as well as Intel based Solaris.

That's how his project which as November 2014 is more then three years old was born. Here is how he described his efforts to preserve and enhance the open source version of SGE:

The Son of Grid Engine is a community project to continue Sun's old gridengine free softwar0e project that used to live at http://gridengine.sunsource.net after Oracle shut down the site and stopped contributing code. (Univa now own the copyright — see below.) It will maintain copies of as much as possible/useful from the old site.

The idea is to encourage sharing, in the spirit of the original project, informed by long experience of free software projects and scientific computing support. Please contribute, and share code or ideas for improvement, especially any ideas for encouraging contribution.

This effort precedes Univa taking over gridengine maintenance and subsequently apparently making it entirely proprietary, rather than the originally-promised ‘open core’. What's here was originally based on Univa's free code and was intended to be fed into that.

ADMIN Magazine (2012):

Univa built on top of Grid Engine 6.2U5 (the last open version from Sun) and released Univa Grid Engine as open source; however, according to Univa, releases will lag between product, currently 8.1, and source, currently 8.0. Univa actively adds features for customers as required and emphasizes that most of the core scheduler work was done by the Sun engineers that now work at Univa, whereas the open community that formed was fundamentally focused on usability and configuration. Thus, versions of Univa Grid Engine will focus on production-worthy status, ensure future development, and deliver rapid turnaround with user issues without relying on community resources.

Son Of Grid Engine

The final Grid Engine implementation is cleverly called “Son of Grid Engine” and started in fall 2010 when it was clear Oracle was not contributing to the gridengine.sunsource.net  site. Son of Grid Engine is a community-supported continuation of the old Sun Grid Engine project. As much of the original information as possible has been preserved, including an active repository for the project (The Open Grid Scheduler forked using a snapshot of the last Sun source tree). Additionally, Son of Grid Engine has collected much useful information from the original project, including the valuable how-tos, as well as active repository and mailing list archives.

The current version. 8.1.1, is based on Univa (version 8) and has incorporated community changes not in the Univa tree. The intention is to be an enhanced superset of the Univa public repository tracking Univa source releases. (Note that the version numbers have diverged, in that Son of Grid Engine version 8.1.1. is not based on the Univa 8.1 source tree.) Active development is evidenced by the project timeline and many releases.

Son of Grid Engine will use any code that looks useful, correct, tractable, and legal, including changes from Open Grid Scheduler and some of the packaging code from Debian and Fedora. (RPMs are available). It is intended to be free software supported by a community of fellow system managers, users, and contributions. They also want to point out that there is more than just the SGE source available in their repository, including other components such as ARCo (Accounting and Reporting Console), Inspect (Monitoring and Configuration Console), and SDM (Service Domain Manager for Grid Engine services adding cloud connectivity).

Current status of open source versions

The fact that Sun opensourced the product was extremely fortunate and it got more traction then, say, open source Solaris, which quietly died after Oracle bought Sun. Sun classic 6.2u5 version sill is the most popular and is reasonably reliable. Oracle version 6.2.7 which is now abandonware is also floating around although status of it is unclear. It can user by former customers with perpetual license as it represent somewhat better implementation then Univa SGE 8.1.7.

Only one reliable patched open source version that is substantial and valuable derivative of Sun 6.2u5 currently exists -- this is Sun of Grid Engine 8.1.8 (as of November,2014).  Several other attempts to continue enhancing Sun open source version were launched and died. As of end of 2014 Sun of Grid Engine is still actively maintained but like many open source project this is a one man game and it is unclear how long Dave Love will last. 

It is somewhat strange that despite the fact that the product is targeted toward research community and computer center with supercomputers, that usually employ a lot of talanted programmers (and some computational chemists I know can give any professional programmer a run for the money ;-), it failed to attack a strong development team. As happned many times before, unfortunately even limited manpower for further development of SGE as an open source project  does not achieve synergy. Each small team wanted to made their own scratch. As the result, the Sun version 6.2u5 codebase was forked in somewhat different directions by several independent groups (see FAQ gridengine.org). Ohter the Son of Grid Engine there two notable effoirts:

In any case due to Sun brass decision to open source codebase, Linux got a reliable, enterprise class, free, open source batch scheduler. Which is included with several Linux distributions (such as Debian). There are interesting open source versions such as Son of Grid engine stemming from the last open source version released by Sun.

There were also two commercial, binary-only versions of SGE:

See SGE implementations for more information.

Dave Love remains the only alive open source warrior on the open source SGE battle field with many victims

As of November 2014 Son of Grid Engine is the only "datacenter class" open source implementation that is still under active development.  Dave Love, the maintainer of this version also created and an very useful web (https://arc.liv.ac.uk/trac/SGE) site with "remnants" of mercilessly discarded by Oracle Sun documents, open source software and other materials. It is such a pity that Oracle people proved to be such a clueless vandals.  

Technical difficulties and the lack of funding were not the only problems facing Dave Love. Univa considered him to be a threat  to their revenue stream and tried to suppress Dave Love effort ([gridengine users] Dave Love's Copyright Infringement).

Ron Chen ron_chen_123 at yahoo.com
Thu Apr 12 17:54:53 UTC 2012
•Previous message: [gridengine users] Univa ad in HPC Wire weekly update
•Next message: [gridengine users] your job is not allowed to run in any queue
• Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I talked to an independent source recently, who told me that Grid Engine is not
"safe". (He said Grid Engine as a whole is not legally safe).

So this is 6 months later, has the copyright infringement been fixed yet?

(I was away for a few months to handle a high priority contract, so I may have
missed something discussed here in the past few months).

-Ron

----- Original Message -----
From: Dave Love <d.love at liverpool.ac.uk>
To: users at gridengine.org
Cc: sge-discuss at liverpool.ac.uk
Sent: Wednesday, November 9, 2011 1:14 PM
Subject: [gridengine users] Beware Univa FUD

I've been told that Univa have been giving customers a "health warning"
to avoid Son of Grid Engine, the community distribution. They claim I'm
distributing Sun proprietary material, and am lucky Oracle lawyers
haven't descended yet.

He withstood this attack.

Note that, contrary to multiple claims, there is no evidence that any Sun (or other) material here is being distributed without an appropriate licence allowing it. See the rebuttal. Reports of licensing bugs are very welcome, of course.

As of version 8.1.8 it  is the most well debugged open source distribution. It is also very attractive for those who what have experience with building software. Installation is pretty raw, but I tried to compensate for that creating several pages which together document installation provess of RHEL 6.5 or 6.6 pretty well:

With the current size of codebase he is engaged in a very difficult, exhausting battle.  We all need to support his efforts. Son of Grid engine is essentially a one man project which is not enough man power for such a complex software. Unfortunately despite the fact that many labs use open source version no strong team of researchers supporting open source version development ever emerged. And even existing efforts splintered the codebase (see the story of Open Grid Scheduler vs ‘Son of Grid Engine’).

In any case we can state that lack of manpower and lack of  support revenue from the open source version of SGE hurts attempts to develop viable derivatives of classic version 6.2u5 and Dave Love was and still is the only guy who successfully swims against the stream.

Instead of Conclusion

Like any software development, the development of Grid engine is essentially a battle of ideas. And this is one of those cases when open source software managed to withstand and survive attempts of commercial developers to monopolize the field for themselves. 

We own much gratitude to the Sun brass which opened the codebase and release pretty advanced version (version 6.2u5) as an open source. It looks like if opened source version released is sufficiently advanced and has certain (critical) number of customers it has better chances to survive. But which project survives and which dies dependents of particular circumstances.  

We also owve  much gratitude to Dave Love who managed to launch and sustain his open source project -- Son of Grid engine, which now is of version 8.1.8.  I wonder if Debian maintainers can  serve as a quality assurance team for this product. This would be a valuable change.

See SGE implementations for more details about SGE development and the issues of selection of the right version of SGE. There is an active SGE users list at gridengine.org site.

Among similar open source products the most prominent are Maui Scheduler and OpenPBS (commercial version is called PBS Pro).


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Feb 08, 2017] Sge,Torque, Pbs WhatS The Best Choise For A Ngs Dedicated Cluster

Feb 08, 2017 | www.biostars.org
Question: Sge,Torque, Pbs : What'S The Best Choise For A Ngs Dedicated Cluster ? 11 gravatar for abihouee 4.4 years ago by abihouee 110 abihouee 110 wrote:

Sorry, it may be off topics...

We plan to install a scheduler on our cluster (DELL blade cluster over Infiniband storage on Linux CentOS 6.3). This cluster is dedicated to do NGS data analysis.

It seems to me that the most current is SGE, but since Oracle bougth the stuff, there are several alternative developments ( OpenGridEngine , SonGridEngine , Univa Grid Engine ...)

An other possible scheluler is Torque / PBS .

I' m a little bit lost in this scheduler forest ! Is there someone with any experiment on this or who knows some existing benchmark ?

Thanks a lot. Audrey

next-gen analysis clustering • 15k views ADD COMMENT • link modified 2.1 years ago by joe.cornish826 4.4k • written 4.4 years ago by abihouee 110 2

I worked with SGE for years at a genome center in Vancouver. Seemed to work quite well. Now I'm at a different genome center and we are using LSF but considering switching to SGE, which is ironic because we are trying to transition from Oracle DB to PostGres to get away from Oracle... SGE and LSF seemed to offer similar functionality and performance as far as I can tell. Both clusters have several 1000 cpus.

ADD REPLY • link modified 4.3 years ago • written 4.3 years ago by Malachi Griffith 14k 1

openlava ( source code ) is an open-source fork of LSF that while lacking some features does work fairly well.

ADD REPLY • link written 2.1 years ago by Malachi Griffith 14k 1

Torque is fine, and very well tested; either of the SGE forks are widely used in this sort of environment, and has qmake, which some people are very fond of. SLURM is another good possibility.

ADD REPLY • link modified 2.1 years ago • written 2.1 years ago by Jonathan Dursi 250 10 gravatar for matted 4.4 years ago by matted 6.3k Boston, United States matted 6.3k wrote:

I can only offer my personal experiences, with the caveat that we didn't do a ton of testing and so others may have differing opinions.

We use SGE, which installs relatively nicely on Ubuntu with the standard package manager (the gridengine-* packages). I'm not sure what the situation is on CentOS.

We previously used Torque/PBS, but the scheduler performance seemed poor and it bogged down with lots of jobs in the queue. When we switched to SGE, we didn't have any problems. This might be a configuration error on our part, though.

When I last tried out Condor (several years ago), installation was quite painful and I gave up. I believe it claims to work in a cross-platform environment, which might be interesting if for example you want to send jobs to Windows workstations.

LSF is another option, but I believe the licenses cost a lot.

My overall impression is that once you get a system running in your environment, they're mostly interchangeable (once you adapt your submission scripts a bit). The ease with which you can set them up does vary, however. If your situation calls for "advanced" usage (MPI integration, Kerberos authentication, strange network storage, job checkpointing, programmatic job submission with DRMAA, etc. etc.), you should check to see which packages seem to support your world the best.

ADD COMMENT • link written 4.4 years ago by matted 6.3k 1

Recent versions of torque have improved a great deal for large numbers of jobs, but yes, that was a real problem.

I also agree that all are more or less fine once they're up and working, and the main way to decide which to use would be to either (a) just pick something future users are familiar with, or (b) pick some very specific things you want to be able to accomplish with the resource manager/scheduler and start finding out which best support those features/workflows.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi 250 4 gravatar for Jeremy Leipzig 4.4 years ago by Jeremy Leipzig 16k Philadelphia, PA Jeremy Leipzig 16k wrote:

Unlike PBS, SGE has qrsh , which is a command that actually run jobs in the foreground, allowing you to easily inform a script when a job is done. What will they think of next?

This is one area where I think the support you pay for going commercial might be worthwhile. At least you'll have someone to field your complaints.

ADD COMMENT • link modified 2.1 years ago • written 4.4 years ago by Jeremy Leipzig 16k 2

EDIT: Some versions of PBS also have qsub -W block=true that works in a very similar way to SGE qsrh.

ADD REPLY • link modified 4.4 years ago • written 4.4 years ago by Sean Davis 22k

you must have a newer version than me

>qsub -W block=true dothis.sh 
qsub: Undefined attribute  MSG=detected presence of an unknown attribute
>qsub --version
version: 2.4.11

ADD REPLY • link modified 4.4 years ago • written 4.4 years ago by Jeremy Leipzig 16k

For Torque and perhaps versions of PBS without -W block=true, you can use the following to switches. The behaviour is similar but when called, any embedded options to qsub will be ignored. Also, stderr/stdout is sent to the shell.

qsub -I -x dothis.sh
ADD REPLY • link modified 16 months ago • written 16 months ago by matt.demaere 0 1

My answer should be updated to say that any DRMAA-compatible cluster engine is fine, though running jobs through DRMAA (e.g. Snakemake --drmaa ) instead of with a batch scheduler may anger your sysadmin, especially if they are not familiar with scientific computing standards.

using qsub -I just to get a exit code is not ok

ADD REPLY • link written 2.1 years ago by Jeremy Leipzig 16k

Torque definitely allows interactive jobs -

qsub -I

As for Condor, I've never seen it used within a cluster; it was designed back in the day for farming out jobs between diverse resources (e.g., workstations after hours) and would have a lot of overhead for working within a homogeneous cluster. Scheduling jobs between clusters, maybe?

ADD REPLY • link modified 2.1 years ago • written 2.1 years ago by Jonathan Dursi 250 4 gravatar for Ashutosh Pandey 4.4 years ago by Ashutosh Pandey 10k Philadelphia Ashutosh Pandey 10k wrote:

We use Rocks Cluster Distribution that comes with SGE.

http://en.wikipedia.org/wiki/Rocks_Cluster_Distribution

ADD COMMENT • link written 4.4 years ago by Ashutosh Pandey 10k 1

+1 Rocks - If you're setting up a dedicated cluster, it will save you a lot of time and pain.

ADD REPLY • link written 4.3 years ago by mike.thon 30

I'm not a huge rocks fan personally, but one huge advantage, especially (but not only) if you have researchers who use XSEDE compute resources in the US, is that you can use the XSEDE campus bridging rocks rolls which bundle up a large number of relevant software packages as well as the cluster management stuff. That also means that you can directly use XSEDEs extensive training materials to help get the cluster's new users up to speed.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi 250 3 gravatar for samsara 4.3 years ago by samsara 470 The Earth samsara 470 wrote:

It has been more than a year i have been using SGE for processing NGS data. I have not experienced any problem with it. I am happy with it. I have not used any other scheduler except Slurm few times.

ADD COMMENT • link written 4.3 years ago by samsara 470 2 gravatar for richard.deborja 2.1 years ago by richard.deborja 80 Canada richard.deborja 80 wrote:

Used SGE at my old institute, currently using PBS and I really wish we had SGE on the new cluster. Things I miss the most, qmake and the "-sync y" qsub option. These two were completely pipeline savers. I also appreciated the integration of MPI with SGE. Not sure how well it works with PBS as we currently don't have it installed.

ADD COMMENT • link written 2.1 years ago by richard.deborja 80 1 gravatar for joe.cornish826 2.1 years ago by joe.cornish826 4.4k United States joe.cornish826 4.4k wrote:

NIH's Biowulf system uses PBS, but most of my gripes about PBS are more about the typical user load. PBS always looks for the next smallest job, so your 30 node run that will take an hour can get stuck behind hundreds (and thousands) of single node jobs that take a few hours each. Other than that it seems to work well enough.

In my undergrad our cluster (UMBC Tara) uses SLURM, didn't have as many problems there but usage there was different, more nodes per user (82 nodes with ~100 users) and more MPI/etc based jobs. However, a grad student in my old lab did manage to crash the head nodes because we were rushing to rerun a ton of jobs two days before a conference. I think it was likely a result of the head node hardware and not SLURM. Made for a few good laughs.

ADD COMMENT • link modified 2.1 years ago • written 2.1 years ago by joe.cornish826 4.4k 2

"PBS always looks for the next smallest job" -- just so people know, that's not something inherent to PBS. That's a configurable choice the scheduler (probably maui in this case) makes, but you can easily configure the scheduler so that bigger jobs so that they don't get starved out by little jobs that get "backfilled" into temporarily open slots.

ADD REPLY • link written 2.1 years ago by Jonathan Dursi 250

Part of it is because Biowulf looks for the next smallest job but also prioritizes by how much cpu time a user has been consuming. If I've run 5 jobs with 30x 24 core nodes each taking 2 hours of wall time, I've used roughly 3600 CPU hours. If someone is using a single core on each node (simple because of memory requirements), they're basically at a 1:1 ratio between wall and cpu time. It will take a while for their CPU hours to catch up to mine.

It is a pain, but unlike math/physics/etc there are fewer programs in bioinformatics that make use of message passing (and when they do, they don't always need low-latency ICs), so it makes more sense to have PBS work for the generic case. This behavior is mostly seen on the ethernet IC nodes, there's a much smaller (245 nodes) system set up with infiniband for jobs that really need it (e.g. MrBayes, structural stuff).

Still I wish they'd try and strike a better balance. I'm guilty of it but it stinks when the queue gets clogged with memory intensive python/perl/R scripts that probably wouldn't need so much memory if they were written in C/C++/etc.

[May 07, 2012] The memories of a Product Manager The True Story of the Grid Engine Dream

April 25, 2012
Grid Engine started with an extraordinary entrepreneur, Dr Wolfgang Gentzsch, who founded Genias Software in 1991 later re-named Gridware in 1999.

Wolfgang says there is only one Grid Engine community, which forms an ecosystem, which he calls a symbiosis of diversity. It all originated in Genias. It implied a huge physical and mental effort, going through Sun acquisition of Gridware in 2000 and later - when Oracle took over Sun in 2011 the creation of the GE ecosystem.

After Wolfgang left Sun, - many fine people in Sun had to leave at that time - it was frustrating to see how our efforts to have two Sun Grid Engine products (one available by subscription and one available as free Open Source) failed because of management veto. On one hand we were under pressure to be profitable as a unit, on the other hand, our customers appeared to have no reason to pay even one cent for a subscription or license.

Oracle still has IP control of Grid Engine. Both Univa and Oracle decided to make no more contributions to the open source. While in Oracle open source policies are clear, Univa, a champion of open source for many years, has surprised the community. This has created an agitated thread on Grid Engine discussion group.

Quoting from Inc. again:

Extraordinary bosses see change as an inevitable part of life. While they don't value change for its own sake, they know that success is only possible if employees and organization embrace new ideas and new ways of doing business.

The Open Source Grid Engine Blog Grid Engine cgroups Integration

May 22, 2012 | blogs.scalablelogic.com
The PDC (Portable Data Collector) in Grid Engine's job execution daemon tracks job-process membership for resource usage accounting purposes, for job control purposes (ie. making sure that jobs don't exceed their resource limits), and for signaling purposes (eg. stopping, killing jobs).

Since most operating systems don't have a mechanism to group of processes into jobs, Grid Engine adds an additional Group ID to each job. As normal processes can't change their GID membership, it is a safe way to tag processes to jobs. On operating systems where the PDC module is enabled, every so often the execution daemon scans all the processes running on the system, and then groups processes to jobs by looking for the additional GID tag.

So far so good, but...

Adding an extra GID has side-effects. We have received reports that applications behave strangely with an unresolvable GID. For example, on Ubuntu, we get:

$ qrsh

groups: cannot find name for group ID 20017

Another problem: it takes time for the PDC to warm up. For some short running jobs, you will find:

removing unreferenced job 64623.394 without job report from ptf

Third problem is that if the PDC runs too often, it takes too much CPU time. In SGE 6.2 u5, a memory accounting bug was introduced because the Grid Engine developers needed to reduce the CPU usage of the PDC on Linux by adding a workaround. (Shameless plug: we the Open Grid Scheduler developers fixed the bug back in 2010, way ahead of any other Grid Engine implementations that are still active these days.) Imagine running ps -elf every second on your execution nodes. This is how intrusive the PDC is!

The final major issue is that the PDC is not accurate. Grid Engine itself does not trust on the information from the PDC at job cleanup. The end result is run-away jobs consuming resources on the execution hosts. The cluster administrators then need to enable the special flag to tell Grid Engine to do proper job cleanup (by default ENABLE_ADDGRP_KILL is off). Quoting the Grid Engine sge_conf manpage:

ENABLE_ADDGRP_KILL

If this parameter is set then Sun Grid Engine uses the supplementary group ids (see gid_range) to identify all processes which are to be terminated when a job is deleted, or when sge_shepherd(8) cleans up after job termination.

Grid Engine cgroups Integration

In Grid Engine 2011.11 update 1, we switch to cgroups instead of the additional GID for the process tagging mechanism.

(We the Open Grid Scheduler / Grid Engine developers wrote the PDC code for AIX, HP-UX, and the initial PDC code for MacOS X, which is used as the base for the FreeBSD and NetBSD PDC. We even wrote a PDC prototype for Linux that does not rely on GID. Our code was contributed to Sun Microsystems, and is used in every implementation of Grid Engine - whether it is commercial, or open source, or commercial open source like Open Grid Scheduler.)

As almost half of the PDCs were developed by us, we knew all the issues in PDC.

We are switching to cgroups now but not earlier because:

  1. Most Linux distributions ship kernels that have cgroups support.
  2. We are seeing more and more cgroups improvements. Lots of cgroups performance issues were fixed in recent Linux kernels.
With the cgroups integration in Grid Engine 2011.11 update 1, all the PDC issues mentioned above are handled. Further, we have bonus features with cgroups:
  1. Accurate memory usage accounting: ie. shared pages are accounted correctly.
  2. Resource limit at the job level, not at the individual process level.
  3. Out of the box SSH integration.
  4. RSS (real memory) limit: we all have jobs that try to use every single byte of memory, but capping their RSS does not hurt their performance. May as well cap the RSS such that we can take back the spare processors for other jobs.
  5. With the cpuset cgroup controller, Grid Engine can set the processor binding and memory locality reliably. Note that jobs that change their own processor binding are not handled by the original Grid Engine Processor Binding with hwloc (Another shameless plug: we are again the first who switched to hwloc for processor binding) - it is very rare to encounter jobs that change their own processor binding, but if a job or external process decides to change its own processor mask, then this will affect other jobs running on the system.
  6. Finally, with the freezer controller, we can have a safe mechanism for stopping and resuming jobs:
$ qstat
job-ID prior name user state submit/start at
queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
16 0.55500 sleep sgeadmin r 05/07/2012 05:44:12
all.q@master 1
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
THAWED
$ qmod -sj 16
sgeadmin - suspended job 16
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
FROZEN
$ qmod -usj 16
sgeadmin - unsuspended job 16
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
THAWED

We will be announcing more new features in Grid Engine 2011.11 update 1 here on this blog. Stay tuned for our announcement.

Which Grid Engine by Chris Dagdigian

February 15, 2012 | Bio-IT World

More than a year ago, Oracle made a decision that while not unexpected within the HPC community was nonetheless met with no small measure of concern. In December 2010, Oracle announced that Grid Engine (a very popular life science cluster scheduler and distributed resource manager that Oracle inherited via its purchase of Sun Microsystems) would no longer be freely available as an open-source product.

Oracle's decision to make Grid Engine available only to commercially licensed customers left a large community of scientific and high performance computing users questioning the viability of their long term technical planning and HPC roadmaps.

Life science users in particular were affected by the announcement as Grid Engine has become the de-facto standard in many bioinformatics, chemistry and genomics computing environments. The Grid Engine "standard" is pervasive enough that even laboratory instrument vendors target it to address customer concerns with integration into existing enterprise environments.

Is Grid Engine still relevant?

Back in 2010 I was less concerned with the future of resource management software as I was knee deep in several cloud projects and quickly saw firsthand how IaaS cloud platforms upend traditional scientific computing environments. Grid Engine enables multiple users, projects and groups to share the same infrastructure effectively. Why would this even be needed "on the cloud" where dynamically provisioning perfectly sized resources and infrastructure on a per-user or per-workflow basis is so trivial?

I was clearly incorrect. Grid Engine and similar software packages are still being actively deployed on cloud platforms. The ability to replicate "legacy" research computing environments is turning out to be a "must have" cloud capability. Grid Engine even has a place in the "new" style of cloud deployment architectures-it turns out that a scheduling and resource allocation system is very handy for organizations that prefer to keep some amount of their infrastructure persistently running and instantly reconfigurable.

In 2011, my work shifted to a number of infrastructure and datacenter refresh projects with biotech and pharmaceutical customers. This is when I realized that the future of Grid Engine was very much still a pressing concern. By mid-2011 organizations that had simply kept on using existing versions of Grid Engine were beginning to think about "what next". Even groups that were totally happy with existing scientific computing environment were starting to plan for the future as major technology advances in multi-core CPU management and GPU computing needed to be reflected in the capabilities of their job schedulers and resource allocation engines. A static, old, or unchanging Grid Engine environment will not be able to handle major advances in HPC, CPU, and GPU computing technologies.

Grid Engine in 2012

Immediately after the Oracle announcement in December 2010, multiple people announced intent to "fork" the last available open-source codebase that Oracle had released, or take that code, and begin independent development of it, creating a distinct piece of software. Of the forks that were created, there were two in particular led by individuals with deep familiarity with Grid Engine internals. This provided significant comfort to people interested in the long term viability of the product as forking the code is simply not enough-viability depends on people with deep prior experience with the complex codebase.

Another major event occurred in January 2011 when Univa announced that it had recruited a number of Oracle employees, including key members of the Grid Engine development and product management team. These members would now be working on a new version of Grid Engine sold and supported by Univa. Seeing an additional commercial company actively investing in and supporting the future of Grid Engine was the final piece I needed to be personally convinced that Grid Engine still had a future.

So where are we in 2012?

In a pretty good position, actually. Grid Engine users now have two sources for commercially-licensed and commercially supported products-both Oracle and Univa supply this. Free software fans and other related open-source projects that depend upon access to an unrestricted resource manager also have two different projects from which to chose.

Even better, a new company called Scalable Logic has announced its intent to provide commercial support and consulting services for one of the free Grid Engine variants. The ability to buy a support contract or even per-incident assistance for a free version of Grid Engine closes my last personal "must-have" feature wishlist.

This is how I have been handling the "what next?" discussions in my own projects:

I'd be interested in hearing your stories. Have you switched from Grid Engine? To what? What other options are people looking at? I know within BioTeam, OpenLava is the next resource manager on our internal list of "to-try" items in the lab.

Available Grid Engine Options

Free & Open Source

Son of Grid Engine
URL: https://arc.liv.ac.uk/trac/SGE
News & Announcements: http://arc.liv.ac.uk/repos/darcs/sge/NEWS
Description: Baseline code comes from the Univa public repo with additional enhancements and improvements added. The maintainer(s) have deep knowledge of SGE source and internals and are committed to the effort. Future releases may start to diverge from Univa as Univa pursues an "open core" development model. Maintainers have made efforts to make building binaries from source easier and the latest release offers RedHat Linux SRPMS and RPM files ready for download. Support: Supported via the maintainers and the users mailing list.

Open Grid Scheduler
URL: http://gridscheduler.sourceforge.net/
Description: Baseline code comes from the last Oracle open source release with significant additional enhancements and improvements added. The maintainer(s) have deep knowledge of SGE source and internals and are committed to the effort. No pre-compiled "courtesy binaries" available at the SourceForge site (just source code and instructions on how to build Grid Engine locally). In November 2011 a new company ScalableLogic announced plans to offer commercial support options for users of Open Grid Scheduler.Support: Supported via the maintainers and the users mailing list. Commercial support from ScalableLogic.

Commercially Supported & Licensed

Univa Grid Engine
URL: http://www.univa.com/products/grid-engine
Description: Commercial company selling Grid Engine, support and layered products that add features and functionality. Several original SGE developers are now employed by Univa. Evaluation versions and "48 cores for free" are available from the website.Support: Univa supports its own products.

Oracle Grid Engine
URL: http://www.oracle.com/us/products/tools/oracle-grid-engine-075549.html
Description: Continuation of "Sun Grid Engine" after Oracle purchased Sun. This is the current commercial version of Oracle Grid Engine after Oracle discontinued the open source version of their product and went 100% closed-source. Support: Oracle supports their own products, a web support forum for Oracle customers can be found at https://forums.oracle.com/forums/forum.jspa?forumID=859

Chris Dagdigian is a consultant with the BioTeam. He can be reached at [email protected].

Open Grid Engine Schedules a New Path By Douglas Eadline, Ph.D.

February 8, 2011 | Linux Magazine

One of the open source refugees from the Oracle/Sun takeover finds a welcome home.

TBack in September, I wrote about some of the changes with Sun Grid Engine (SGE). Basically, there was some concern due to the release of the version 6.2u6 product binaries, but no corresponding update to the open source tree. The open source site had no updates and was still at version 6.2u5. Obviously, the recent acquisition of Sun by Oracle added further concern. In particular, many believed Oracle was going to kick open SGE to the street. There have been some changes, but no kicking has been noted.

First let's have a look at what Oracle had to say, as posted on the gridengine.sunsource.net mailing list by Dan Templton on December 24. The list has not had any traffic since January 2 and is presumed closed. Here are the highlights:

Today, we are entering a new chapter in Oracle Grid Engine's life. Oracle has been working with key members of the open source community to pass on the torch for maintaining the open source code base to the Open Grid Scheduler project hosted on SourceForge. This transition will allow the Oracle Grid Engine engineering team to focus their efforts more directly on enhancing the product. In a matter of days, we will take definitive steps in order to roll out this transition. To ensure on-going communication with the open source community, we will provide the following services:

Not exactly kicked to the street, but clear that you can't come back inside if it starts raining. Oracle seems to be trying to do the right thing and they will continue to develop and sell a version of Oracle Grid Engine.

As the dust continues to settle, the good news is SGE had open source insurance and as such will not vanish even when companies do. The first big announcement is from Univa where they have hired the principal engineers from the Sun/Oracle Grid Engine team, including Grid Engine founder and original project owner Fritz Ferstl. Univa will concentrate on improving Grid Engine for technical computing and HPC and promote the continuity of the Grid Engine open source community. Good news indeed. For a good overview and some history check out this article by one of the original developers, Wolfgang Gentzsh.

There are still a few things that need to be worked out. There seem to be two code repositories. The first is Son Of Grid Engine set up by Dave Love. This site has plenty of information and includes important links. It also has instructions on how to build the code. The second site, as mentioned by Oracle, is the Open Grid Scheduler project. This site is mostly a code dump and was set up by many of the community coders who worked on SGE in the past.

There are other interested parties as well and there is an ongoing discussion on how to proceed with two code bases. The next big announcement was the emergence of GridEngine.org. This site has more background and information including the mention of a Steering Committee comprised of the following people:

These are the right people for the job and I expect that before to long a unified code base called Open Son of Grid Engine Scheduler (or some more suitable name) will surface. At that point I expect a great future for all those involved. There is also a new mailing list that is worth joining.

Of course, you can still buy Oracle Grid Engine (OGE) if you so desire. With all the interest Oracle has shown in HPC, I can't see a down side to that choice. Seriously, Oracle has plans for OGE which don't involve HPC and now that Univa, The BioTeam, and Bad Dog Consulting have stepped up to the plate, I'm sure the HPC crowd will know who to call.

Grid Engine Running on All Four Cylinders by Douglas Eadline

2012 | ADMIN Magazine

In 2010, Oracle's purchase of Sun Microsystems ended an era of technology leadership. Such acquisitions preserve intellectual property and some key individuals, but a large part of the personality and passion often spreads in the technology wind. HPC users also had many questions about the acquisition. Aside from the shivers of fear sent down the MySQL community, two Sun open software projects are used in the HPC arena. The first is the Lustre parallel filesystem, for which Oracle dropped future support or development and has since been picked up by several companies working under a GPLv2 license.

The second is Sun Grid Engine. Unlike Lustre, Oracle still offers what is now called Oracle Grid Engine to customers as a closed source product. The original Sun Grid Engine, previously known as CODINE (COmputing in DIstributed Networked Environments) or GRD (Global Resource Director), came to Sun through the purchase of Gridware Inc. in 2000. After renaming it Sun Grid Engine, Sun offered the package with source code in 2001 and also sold a commercial version called N1 Grid Engine (N1GE).

The open source license used by Sun, called the Sun Industry Standards Source License (SISSL), is now a retired free and open source license. It was recognized as an "open license" by the Free Software Foundation and the Open Source Initiative (OSI). The license is somewhat interesting. Under SISSL, developers could modify and distribute source code and derived binaries freely. Modifications could be kept private or made public; however, SISSL required that "The Modifications which You create must comply with all requirements set out by the Standards body in effect one hundred twenty (120) days before You ship the Contributor Version." If the Modifications do not comply with the current standards, SISSL becomes a copyleft license, and source must be published "under the same terms as this license [SISSL] on a royalty free basis within thirty (30) days." Thus, as long as shipped binaries are standards compliant, there is no requirement to ship source code. The latest official released source code from Sun was version 6.2 Update 5.

One of the more interesting aspects of HPC is the use of open source for much of the cluster "plumbing" or infrastructure. When a package suddenly undergoes a change in ownership, the future openness and availability is often of some concern. In the two years since Oracle's purchase, the four major efforts have come to the fore:
##Oracle Grid Engine
##Open Grid Scheduler
##Univa Grid Engine
##Son of Grid Engine

To get a sense of where each of these products/projects fits into the HPC landscape, I contacted each group and asked some questions about features, codebases, and the future. The first and easiest is Oracle Grid Engine.

Oracle Grid Engine

Having no direct contact person at Oracle, I attempted to email their sales channels asking for someone with which I could ask some questions. I have not received any response. Before, you bring out the pitchforks and clubs, understand that Oracle has stated HPC is not a market in which they are interested. The Oracle Grid Engine Support page has plenty of information, and a 90-day free trial is purportedly available for those who register. It appears that Oracle Grid Engine is under active development and support but not targeted at HPC. It is not clear whether Oracle is adding their own features, integrating some of those found in the open versions (discussed below), or both. Also note that some of the Sun documentation is available on Oracle's website. In 2010, after the purchase of Sun, the Grid Engine 6.2 update 6 source code was not available in releases of new binaries.

Open Grid Scheduler

Many users might not know, but Oracle worked hard to do a smooth hand-off to the open source community. The Open Grid Scheduler project was chosen by Oracle in 2010 as the open source Grid Engine maintainer. Members of the project who were not employees of Sun but had been contributing code to Sun Grid Engine since 2001 formed a company called Scalable Logic to support the open version of Grid Engine.

The Scalable Logic team continues an open Grid Engine development effort. The team at Scalable Logic plans a feature release once a year combined with update releases. One of the first major enhancements was the inclusion of hwloc (hardware locality) multicore binding.

In the up coming release, they are also adding the following new features:
##C Groups – a Linux kernel feature to limit, account for, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups as an alternative to the Grid Engine Portable Data Collector.
##AMD Optimizations – support for the new Bulldozer architecture.
##Intel Xeon Phi – support for Intel MIC when Intel releases the new architecture.

Additionally, they had a bug fix for Cygwin support, support for Linux 3.0 on ARM, and the much-appreciated removal of NFSv4 dependency on BerkeleyDB spooling. They also merged fixes and features from end users, such as the updated Hadoop integration. Finally, the Scalable Logic team runs and contributes heavily to the Gird Engine Users mailing list, which is similar to the once popular, [email protected] mailing list.

Scalable Logic also works closely with hardware vendors such as NVidia to include things such as GPU monitoring. Because some of the hardware features contain code protected by non-disclosure agreements, they can't use an open development model. These features do end up in the open version, however. Clearly, the Open Grid Scheduler/Grid Engine team is pushing the project to new heights and leading the way with support and new features, many of which show up later in other Grid Engine implementations.

Univa Grid Engine

On January 18, 2011, it was announced that Univa had recruited several principal engineers from the former Sun Grid Engine team and that Univa would be developing and supporting their version of Grid Engine. Univa Grid Engine, in addition to being open source, offers enhanced testing and support. Univa was a joint developer with Sun for components of the Sun HPC software stack and an OEM of Sun Grid Engine. Univa has delivered three Grid Engine production releases and seven updates in the past year.

In terms of software development, Univa reports that development has been extremely active and well funded. Moreover, Univa has invested millions of dollars in infrastructure, development, QA/automated testing, and customer support. Univa has also released UniSight for reporting and analytics and UniCloud for dynamic application management and will be releasing License Orchestrator for beta in Fall 2012.

Univa built on top of Grid Engine 6.2U5 (the last open version from Sun) and released Univa Grid Engine as open source; however, according to Univa, releases will lag between product, currently 8.1, and source, currently 8.0. Univa actively adds features for customers as required and emphasizes that most of the core scheduler work was done by the Sun engineers that now work at Univa, whereas the open community that formed was fundamentally focused on usability and configuration. Thus, versions of Univa Grid Engine will focus on production-worthy status, ensure future development, and deliver rapid turnaround with user issues without relying on community resources.

Son Of Grid Engine

The final Grid Engine implementation is cleverly called "Son of Grid Engine" and started in fall 2010 when it was clear Oracle was not contributing to the gridengine.sunsource.net site. Son of Grid Engine is a community-supported continuation of the old Sun Grid Engine project. As much of the original information as possible has been preserved, including an active repository for the project (The Open Grid Scheduler forked using a snapshot of the last Sun source tree). Additionally, Son of Grid Engine has collected much useful information from the original project, including the valuable how-tos, as well as active repository and mailing list archives.

The current version. 8.1.1, is based on Univa (version 8) and has incorporated community changes not in the Univa tree. The intention is to be an enhanced superset of the Univa public repository tracking Univa source releases. (Note that the version numbers have diverged, in that Son of Grid Engine version 8.1.1. is not based on the Univa 8.1 source tree.) Active development is evidenced by the project timeline and many releases.

Son of Grid Engine will use any code that looks useful, correct, tractable, and legal, including changes from Open Grid Scheduler and some of the packaging code from Debian and Fedora. (RPMs are available). It is intended to be free software supported by a community of fellow system managers, users, and contributions. They also want to point out that there is more than just the SGE source available in their repository, including other components such as ARCo (Accounting and Reporting Console), Inspect (Monitoring and Configuration Console), and SDM (Service Domain Manager for Grid Engine services adding cloud connectivity).

Summary

The current HPC Grid Engine seems have diverged into four camps. Oracle continues to offer Oracle Grid Engine but has no interest in the HPC market. Support forums are still active and open, but the package can be expected to diverge from the other efforts. Open Grid Scheduler backed by Scalable Logic is a fork of the last Sun open source release. They have a depth of experience, offer paid support, and are providing many of the leading enhancements in their codebase (that often gets adopted by others). They provide source code, and contribution is welcome through their mailing lists. Univa offers, by virtue of it's acquisition of Sun engineers, a deeply supported and tested product. Source code is available via a repository; however, code releases will lag the binary release. They also offer other products that integrate and enhance their core Grid Engine version. Finally, Son of Grid Engine is an open continuation and preservation of the original Sun Grid Engine project. It offers an open repository, an enhanced Univa codebase, and lots of useful documentation.

As often happens in the open source world, what was once a single domain of development and distribution has now grown into divergent paths. What was once Grid Engine from Sun is now four "different but similar" projects that seem to be seeking their own niches. Perhaps the bigger lesson in this transition is the strength of open source in a changing market. Those that tied their boat to Grid Engine can rest assured that support, fixes, and features will continue and will be available at a level to suit their needs.

Thanks to Rayson Ho of Scalable Logic, Gary Tyreman of Univa, and Dave Love of the Son of Grid Engine project for their valuable input.

Univa forks Oracle's Sun Grid Engine

The Register

Another fork has appeared in the Sun Microsystems software road. Univa is forking the Sun Grid Engine project, now controlled by Oracle.

In the wake of Oracle's $5.6bn acquisition of Sun a year ago, co-founder and chief executive officer Larry Ellison made no secret of the fact that Oracle was not going to waste time on products and projects that do not make the company money. And rightly so, by the way.

Sun Microsystems was not a research lab or a charity, but the company's top brass often behaved as if it was. Oracle has backed out of selling x64 servers and switches into HPC shops at little or no margin, and Grid Engine, a program for gathering up spare CPU capacity on desktop and laptop machines as well as on clusters of servers to run supercomputer simulations and number-crunching jobs, has gone fallow.

Gary Tyreman, who has been chief executive officer at Univa for the past three years, does not bear Ellison or the rest of Oracle any malice. He just thinks that Grid Engine is not a priority for Oracle and is dying of neglect. Univa is a company that has an OEM license from Sun, which transferred over to Oracle, for the Grid Engine product, and it makes a living selling and supporting Grid Engine and extensions to the product. Univa wants Grid Engine to be extended and improved in ways that help HPC customers. And so, the company will be working with other Grid Engine community members to put together a new distribution and offer support on that as well as prior Grid Engine versions.

To that end, Univa has hired Fritz Ferstl, Grid Engine founder and original project owner, as well as an unspecified number of principal engineers from the Sun/Oracle Grid Engine software development and support team to keep Grid Engine going. Ferstl will become Univa's chief technology officer and will run the company's EMEA operations.

"There is obvious concern in the industry when Oracle is making decisions that are appropriate for its business but not necessarily helpful for technical computing," Tyreman explained to El Reg. "We think there is a void that we can fill."

Univa was founded in 2004 by the creators of the Globus toolkit, an effort that dates from 1995 that sought to merge the grid computing techniques used in supercomputers with evolving Web services to create a compute and storage utility infrastructure. (We call this a cloud these days, but it is really a utility.) The company's founders include Steve Tueck and Ian Foster, of Argonne National Lab, and Carl Kesselman, a researcher at the University of Southern California. Foster and Kesselman were the leaders of Globus toolkit.

The Globus Alliance was set up in 2003 to steer the development of the Globus toolkit and its integration with other necessary components in an HPC stack. (A cluster does not live by its resource manager alone, and the Globus grid software can be managed by the PBS, Condor, and Platform LSF schedulers with unofficial support for Grid Engine through third parties.) Univa was originally established to provider commercial support for the Globus toolkit, but it expanded out to support Grid Engine integration with Globus and then started distributing Grid Engine itself.

The way the Grid Engine clustering and job scheduling software worked at Sun, there were supported binaries distributed by Sun as well as unsupported binaries based on the open source code. Under the OEM deal with Sun, Univa acquired access to the supported binaries and provided level one and two tech support on it with level three support backing from Sun.

Tyreman says that the last time Oracle pumped out a new open source version of the Sun grid software was with Grid Engine 6.2 Update 5, which was a little more than a year ago. Univa will be working with Bad Dog Consulting, which provides services for the open source Grid Engine, and the Open Grid Scheduler, a project that was formed last year when Oracle put out Grid Engine 6.2 Update 6 without source code. Open Grid Scheduler sought to maintain the Grid Engine product and provide patches and updates, just like Univa is promising to do. The difference with Univa is that it has the tech people on staff who can credibly offer an alternative to what Oracle is doing. Or not doing, as the case may be.

Incidentally, Oracle is not killing off Grid Engine and continues to have some people dedicated to the product, which Tyreman says is being predominantly positioned for financial services customers and which is being integrated into Oracle's Enterprise Manager system management tools. At the end of December, Daniel Templeton, principal product manager for the Grid Engine product at Oracle, blogged that "changes for a bright future at Oracle" were afoot for Grid Engine.

This included decommissioning the open source site and repositories and transitioning it to the Oracle Technology Network. Templeton said the Grid Engine software engineers at Oracle would be available to help with the open source and binary versions of the tool, and that Open Grid Scheduler would "be continuing on the tradition of the Grid Engine open source project" and that OGS would remain independent of Oracle Grid Engine, with "support of the Oracle team." He added that Oracle was committed to enhancing Grid Engine and was putting together a new roadmap.

It is unclear what the departure of the engineers from Oracle for jobs at Univa has done to these plans. What is clear is that you can say that Oracle caused a fork in the Grid Engine code every bit as much as it forced one with Solaris, Lustre, and other pieces of the Sun software stack.

Tyreman says that there are over 4 million CPUs in over 1,000 government, academic, and commercial establishments that have Grid Engine deploying jobs on them. He also says that this may be a low-ball figure, with as many as 2,000 to 10,000 organizations possibly using the free binaries or open source code to run jobs on their clusters. This is not a small installed base, but it is one that got used to having a say in the development of the code as well as the luxury of commercial support from Sun and Univa.

Once all of the players in the Grid Engine arena coordinate with each other, Univa will be working to roll up patches to the product based on the Grid Engine 6.2 Update 5 version, which is the last open source release. Over the long haul, Tyreman says there will be a Univa-branded version of Grid Engine based on the open source code put out, very likely before the end of the first quarter. Because of the license that Univa had with Sun, Univa has the right to call the program Grid Engine, but Univa may call it something else. In the meantime, Univa continues to sell support services for all prior versions of Grid Engine, as it has been doing for years now, as well as its add-on products for Grid Engine.

Grid Engine support costs $99 per processor core per year. An extended product called Univa UniCluster is an entire stack of software for provisioning and managing a stack, which has the Grid Engine job scheduler at the heart of that. If you want this functionality, you add $25 per core on top of the base Grid Engine support fee. (These extensions are not open source, by the way. Univa hews to the open core or freemium philosophy of software distribution - the core is open, but the extra goodies are not). Another add-on, called UniCloud, allows for the provisioning of hypervisor-based server instances (either locally or in public clouds such as Amazon EC2, Rackspace Cloud, or GoGrid) as well as the bare-metal provisioning that UniCluster can do. Add $25 per core per year on top of that (or $149 in total) if you want that feature.

Univa also has a tool that converts the scripts used in Platform Computing's Load Sharing Facility (LSF), arguably the pioneering grid scheduler in the world, so they can be run on Grid Engine. The conversion tool can emulate more than 100 LSF commands and convert them to the equivalent Grid Engine functions. This tool can port about 90 per cent of the LSF commands, says Tyreman, making it a lot easier for companies to jump from LSF to Grid Engine.

Univa has 25 employees now with the addition of the Oracle people, and it has 60 paying customers. ®

Google Summer of Code 2007 Ideas

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Top articles

Sites

Sun GridEngine - Rocks Clusters

GridWiki

Univa Grid Engine Open Core · GitHub

Oracle Grid Engine

Oracle documentation (Oracle has taken down the Grid Engine pages after announcing a hand-over of support for Oracle Grid Engine to Univa.)

Oracle Grid Engine - Wikipedia, the free encyclopedia

(Sun BluePrints have been taken off the web by Oracle. Maybe some web archives still have copies of the content.)

Sun maintains an interesting technical library of "BluePrint Documents". Interesting publications include:

Univa Products Grid Engine Software for Workload Scheduling and Management

SunSource.net SGE Project Home

SunSource.net SGE Mailing Lists / Discussion Forums

Sun Wikis SGE Information Center Home

Gridengine.org

Gridengine.info

Gridscheduler.sourceforge.net

Grid Engine HOWTOs

Archive of defunct Sun gridengine.sunsource.net

Bioteam SGE Administration Training Slides

Bioteam SGE Quick Reference Guide

Bioteam SGE for Users Slides

Sample SGE Submission Scripts from NBCR



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019