[kepler-dev] Q: SGE command execution on remote SGE cluster started on local PC fails due to environment problems
Hoeftberger, Johann
Johann_Hoeftberger at DFCI.HARVARD.EDU
Tue Feb 24 13:43:54 PST 2015
Even after I got rid of the SSH connection parts of my Workflow and ran
it directly on the SGE cluster head node itself I got the same error
Finaly I found out how I can at least circumvent the error, so I can run
the work flow locally on the cluster itself.
I had a setting used in the JobManager, the target property, with
username at clusterservername.
As long as I had this setting this way my work flow didn't run at all.
After I changed this setting to "local" I could run it at least directly
on the cluster. That's the best solution I have found so far.
I found a discussion about the maximum number of allowed dynamic event
clients which seemed to me related to my problem.
+ No setting if maximum allowed dynamic event clients are exceed:
In my case this number is one!
And a discussion about interactive vs compute vs head node cluster
nodes. My head node is an interactive node!
+ Can't be set on compute/interactive nodes:
That's all I could find out so far, I could not solve the original
problem itself.
On 02/13/2015 05:31 PM, Jianwu Wang wrote:
> Hi Johann,
> I checked my old workflow and found I have to make some changes
> to make it working. The jobSubmitOptions setting working for me right
> now is "-ac SGE_CELL=hoffman2,SGE_ROOT=/u/systems/UGE8.0.1 -l
> h_data=1G". Since different clusters have different SGE installed, you
> need to find the one for your cluster. But it looks "-ac" is now needed
> Best wishes
> Sincerely yours
> Jianwu WANG, Ph.D.
> jianwu at sdsc.edu
> http://users.sdsc.edu/~jianwu/
> Assistant Director for Research
> Workflows for Data Science (WorDS) Center of Excellence
> San Diego Supercomputer Center (SDSC)
> University of California, San Diego (UCSD)
> On 2/13/15 10:49 AM, Hoeftberger, Johann wrote:
>> Hello Jianwu,
>> thank you for your interesting answer, I didn't take notice of that
>> settings by myself. Good hint!
>> I followed your advice and got rid of the qsub - unknown command error
>> through it. But although I set SGE_ROOT and SGE_CELL in
>> JobSubmitter-jobSubmitOptions options I still can't run my work flow,
>> always get the following errror/exception:
>> ERROR (org.kepler.actor.job.JobSubmitter:fire:226)
>> org.kepler.job.JobException: Error at job submission.
>> Command:cd [...]/SGE_Testscripts/[...]_Feb13_093517EST_0;
>> /opt/sge/bin/lx24-amd64/qsub SGE_ROOT=/opt/sge,SGE_CELL=default
>> [...]/SGE_Testscripts/[...]_Feb13_093517EST_0/sgeTestscript.sh
>> Stdout:
>> Stderr:
>> Unable to initialize environment because of error: Please set the
>> environment variable SGE_ROOT.
>> Exiting.
>> org.kepler.job.JobException: Error at job submission.
>> Command:cd [...]/SGE_Testscripts/[...]_Feb13_093517EST_0;
>> /opt/sge/bin/lx24-amd64/qsub SGE_ROOT=/opt/sge,SGE_CELL=default
>> [...]/SGE_Testscripts/[...]_Feb13_093517EST_0/sgeTestscript.sh
>> Stdout:
>> Stderr:
>> Unable to initialize environment because of error: Please set the
>> environment variable SGE_ROOT.
>> Exiting.
>> at org.kepler.job.JobManager.submit(JobManager.java:307)
>> at org.kepler.job.Job.submit(Job.java:375)
>> at org.kepler.actor.job.JobSubmitter.fire(JobSubmitter.java:217)
>> at
>> ptolemy.actor.process.ProcessThread._iterateActor(ProcessThread.java:335)
>> at ptolemy.actor.process.ProcessThread.run(ProcessThread.java:212)
>> I tried different ways to set the SGE_ROOT setting, alone, together with
>> SGE_CELL, in parentheses, separated by comma, separated by semicolon.
>> All with the same outcome, Kepler "thinks" SGE_root isn't set.
>> Do you have further ideas how I could solve my issue.
>> Best regards,
>> Johann
>> On 02/06/2015 05:20 PM, Jianwu Wang wrote:
>>> Hi Johann,
>>> Can you try adding SGE_CELL and SGE_ROOT settings at 'job submit
>>> options' parameter of GenericJobLauncher actor or 'jobSubmitOptions'
>>> parameter of JobSubmitter actor? An example is
>>> "SGE_CELL=hoffman2,SGE_ROOT=/u/systems/SGE6.1u3". You might also need to
>>> set the path for your qsub (such as
>>> "/u/systems/SGE6.2u4/bin/lx26-amd64") to the 'binary path' parameter of
>>> GenericJobLauncher actor or 'binPath' parameter of JobManager actor. I
>>> remember I met a similar problem before and this did the trick.
>>> Best wishes
>>> Sincerely yours
>>> Jianwu WANG, Ph.D.
>>> jianwu at sdsc.edu
>>> http://users.sdsc.edu/~jianwu/
>>> Assistant Director for Research
>>> Workflows for Data Science (WorDS) Center of Excellence
>>> San Diego Supercomputer Center (SDSC)
>>> University of California, San Diego (UCSD)
>>> On 2/6/15 3:31 PM, Hoeftberger, Johann wrote:
>>>> Hello,
>>>> I try to create my own simple Kepler test workflow on my local PC and
>>>> run it via SSH on a SGE cluster.
>>>> For that I took the Kepler demo Workflow
>>>> "Job_Submission_Using_JobManager" configured it for my local situation
>>>> an tried to run it.
>>>> I know that the connection to the cluster, the login with the given
>>>> credentials and the settings for the working directory work well. (I
>>>> get
>>>> created directories and files for the cluster jobs which should be
>>>> executed.)
>>>> The execution of the qsub command on the cluster doesn't work
>>>> because at
>>>> first I got the exception "qsub: unknown command" and when I hardcoded
>>>> the full path for the qsub command in the implementation (to locate the
>>>> error only), I changed the initialization of private String
>>>> _sgeSubmitCmd in JobSupportSGE.java to the full path of the qsub
>>>> command
>>>> on my SGE cluster, I got the next exception "Unable to initialize
>>>> environment because of error: Please set the environment variable
>>>> SGE_ROOT".
>>>> SGE_ROOT is properly set on the SGE cluster. I tried to set it
>>>> additionally on my local PC (afterwards I restarted Eclipse where my
>>>> Kepler instance is running) and in the Eclipse - Run - Run
>>>> Configuration
>>>> - Java Application - Environment. All these tries didn't solve the
>>>> problem, I still get the same exception about the uninitialized
>>>> environment.
>>>> I found in SshExec.java::public int executeCmd(String command,
>>>> OutputStream streamOut, OutputStream streamErr, String
>>>> thirdPartyTarget)
>>>> the documentation
>>>> /**
>>>> * Execute a command on the remote machine and expect a
>>>> password/passphrase
>>>> * question from the command. The stream <i>streamOut</i>
>>>> should be
>>>> provided
>>>> * to get the output and errors merged. <i>streamErr</i> is not
>>>> used in
>>>> this
>>>> * method (it will be empty string finally).
>>>> *
>>>> * @return exit code of command if execution succeeded,
>>>> * @throws ExecTimeoutException
>>>> * if the command failed because of timeout
>>>> * @throws SshException
>>>> * if an error occurs for the ssh connection during
>>>> the command
>>>> * execution Note: in this method, the SSH Channel is
>>>> forcing a
>>>> * pseudo-terminal allocation {see setPty(true)} to
>>>> allow
>>>> remote
>>>> * commands to read something from their stdin (i.e.
>>>> from us
>>>> * here), thus, (1) remote environment is not set from
>>>> * .bashrc/.cshrc and (2) stdout and stderr come back
>>>> merged in
>>>> * one stream.
>>>> */
>>>> so I guess my problem is caused through an uninitialized or wrong
>>>> initialized used (pseudo) terminal.
>>>> For me it seems the set environment variables on the used systems (SGE
>>>> cluster, local PC) are not read and the Kepler implementation itself
>>>> doesn't set proper values for the needed variables.
>>>> I haven't found a description or a configuration possibility for my
>>>> issue yet. So I think it is some kind of implementation flaw.
>>>> Can somebody give me a hint how to solve this issue. I would like to
>>>> run
>>>> Kepler locally on my PC but execute parts of my Workflow remotely on my
>>>> SGE cluster.
>>>> Kind regards,
>>>> Johann Hoeftberger
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
More information about the Kepler-dev
mailing list