[kepler-dev] Q: SGE command execution on remote SGE cluster started on local PC fails due to environment problems

Hoeftberger, Johann Johann_Hoeftberger at DFCI.HARVARD.EDU
Tue Feb 24 13:43:54 PST 2015


Hello,

Even after I got rid of the SSH connection parts of my Workflow and ran 
it directly on the SGE cluster head node itself I got the same error 
message.
Finaly I found out how I can at least circumvent the error, so I can run 
the work flow locally on the cluster itself.

I had a setting used in the JobManager, the target property, with 
username at clusterservername.
As long as I had this setting this way my work flow didn't run at all. 
After I changed this setting to "local" I could run it at least directly 
on the cluster. That's the best solution I have found so far.

I found a discussion about the maximum number of allowed dynamic event 
clients which seemed to me related to my problem.
    + No setting if maximum allowed dynamic event clients are exceed:
 
http://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode

In my case this number is one!

And a discussion about interactive vs compute vs head node cluster 
nodes. My head node is an interactive node!

    + Can't be set on compute/interactive nodes:
      http://www.biac.duke.edu/forums/topic.asp?TOPIC_ID=1428


That's all I could find out so far, I could not solve the original 
problem  itself.


Regards,
Johann

On 02/13/2015 05:31 PM, Jianwu Wang wrote:
> Hi Johann,
>
>          I checked my old workflow and found I have to make some changes
> to make it working. The jobSubmitOptions setting working for me right
> now is "-ac SGE_CELL=hoffman2,SGE_ROOT=/u/systems/UGE8.0.1 -l
> h_data=1G". Since different clusters have different SGE installed, you
> need to find the one for your cluster. But it looks "-ac" is now needed
> for SGE_CELL and SEGE_ROOT.
>
> Best wishes
>
> Sincerely yours
>
> Jianwu WANG, Ph.D.
> jianwu at sdsc.edu
> http://users.sdsc.edu/~jianwu/
>
> Assistant Director for Research
> Workflows for Data Science (WorDS) Center of Excellence
> San Diego Supercomputer Center (SDSC)
> University of California, San Diego (UCSD)
>
> On 2/13/15 10:49 AM, Hoeftberger, Johann wrote:
>> Hello Jianwu,
>>
>> thank you for your interesting answer, I didn't take notice of that
>> settings by myself. Good hint!
>>
>> I followed your advice and got rid of the qsub - unknown command error
>> through it. But although I set SGE_ROOT and SGE_CELL in
>> JobSubmitter-jobSubmitOptions options I still can't run my work flow,
>> always get the following errror/exception:
>>
>> ERROR (org.kepler.actor.job.JobSubmitter:fire:226)
>> org.kepler.job.JobException: Error at job submission.
>> Command:cd [...]/SGE_Testscripts/[...]_Feb13_093517EST_0;
>> /opt/sge/bin/lx24-amd64/qsub  SGE_ROOT=/opt/sge,SGE_CELL=default
>> [...]/SGE_Testscripts/[...]_Feb13_093517EST_0/sgeTestscript.sh
>> Stdout:
>>
>> Stderr:
>>
>> Unable to initialize environment because of error: Please set the
>> environment variable SGE_ROOT.
>> Exiting.
>>
>> org.kepler.job.JobException: Error at job submission.
>> Command:cd [...]/SGE_Testscripts/[...]_Feb13_093517EST_0;
>> /opt/sge/bin/lx24-amd64/qsub  SGE_ROOT=/opt/sge,SGE_CELL=default
>> [...]/SGE_Testscripts/[...]_Feb13_093517EST_0/sgeTestscript.sh
>> Stdout:
>>
>> Stderr:
>>
>> Unable to initialize environment because of error: Please set the
>> environment variable SGE_ROOT.
>> Exiting.
>>
>>     at org.kepler.job.JobManager.submit(JobManager.java:307)
>>     at org.kepler.job.Job.submit(Job.java:375)
>>     at org.kepler.actor.job.JobSubmitter.fire(JobSubmitter.java:217)
>>     at
>> ptolemy.actor.process.ProcessThread._iterateActor(ProcessThread.java:335)
>>     at ptolemy.actor.process.ProcessThread.run(ProcessThread.java:212)
>>
>>
>> I tried different ways to set the SGE_ROOT setting, alone, together with
>> SGE_CELL, in parentheses, separated by comma, separated by semicolon.
>> All with the same outcome, Kepler "thinks" SGE_root isn't set.
>>
>> Do you have further ideas how I could solve my issue.
>>
>> Best regards,
>> Johann
>>
>>
>> On 02/06/2015 05:20 PM, Jianwu Wang wrote:
>>> Hi Johann,
>>>
>>>       Can you try adding SGE_CELL and SGE_ROOT settings at 'job submit
>>> options' parameter of GenericJobLauncher actor or 'jobSubmitOptions'
>>> parameter of JobSubmitter actor? An example is
>>> "SGE_CELL=hoffman2,SGE_ROOT=/u/systems/SGE6.1u3". You might also need to
>>> set the path for your qsub (such as
>>> "/u/systems/SGE6.2u4/bin/lx26-amd64") to the 'binary path' parameter of
>>> GenericJobLauncher actor or 'binPath' parameter of JobManager actor. I
>>> remember I met a similar problem before and this did the trick.
>>>
>>> Best wishes
>>>
>>> Sincerely yours
>>>
>>> Jianwu WANG, Ph.D.
>>> jianwu at sdsc.edu
>>> http://users.sdsc.edu/~jianwu/
>>>
>>> Assistant Director for Research
>>> Workflows for Data Science (WorDS) Center of Excellence
>>> San Diego Supercomputer Center (SDSC)
>>> University of California, San Diego (UCSD)
>>>
>>> On 2/6/15 3:31 PM, Hoeftberger, Johann wrote:
>>>> Hello,
>>>>
>>>> I try to create my own simple Kepler test workflow on my local PC and
>>>> run it via SSH on a SGE cluster.
>>>> For that I took the Kepler demo Workflow
>>>> "Job_Submission_Using_JobManager" configured it for my local situation
>>>> an tried to run it.
>>>> I know that the connection to the cluster, the login with the given
>>>> credentials and the settings for the working directory work well. (I
>>>> get
>>>> created directories and files for the cluster jobs which should be
>>>> executed.)
>>>>
>>>> The execution of the qsub command on the cluster doesn't work
>>>> because at
>>>> first I got the exception "qsub: unknown command" and when I hardcoded
>>>> the full path for the qsub command in the implementation (to locate the
>>>> error only), I changed the initialization of private String
>>>> _sgeSubmitCmd in JobSupportSGE.java to the full path of the qsub
>>>> command
>>>> on my SGE cluster, I got the next exception "Unable to initialize
>>>> environment because of error: Please set the environment variable
>>>> SGE_ROOT".
>>>>
>>>> SGE_ROOT is properly set on the SGE cluster. I tried to set it
>>>> additionally on my local PC (afterwards I restarted Eclipse where my
>>>> Kepler instance is running) and in the Eclipse - Run - Run
>>>> Configuration
>>>> - Java Application - Environment. All these tries didn't solve the
>>>> problem, I still get the same exception about the uninitialized
>>>> environment.
>>>>
>>>> I found in SshExec.java::public int executeCmd(String command,
>>>> OutputStream streamOut, OutputStream streamErr, String
>>>> thirdPartyTarget)
>>>> the documentation
>>>>
>>>> /**
>>>>       * Execute a command on the remote machine and expect a
>>>> password/passphrase
>>>>       * question from the command. The stream <i>streamOut</i>
>>>> should be
>>>> provided
>>>>       * to get the output and errors merged. <i>streamErr</i> is not
>>>> used in
>>>> this
>>>>       * method (it will be empty string finally).
>>>>       *
>>>>       * @return exit code of command if execution succeeded,
>>>>       * @throws ExecTimeoutException
>>>>       *             if the command failed because of timeout
>>>>       * @throws SshException
>>>>       *             if an error occurs for the ssh connection during
>>>> the command
>>>>       *             execution Note: in this method, the SSH Channel is
>>>> forcing a
>>>>       *             pseudo-terminal allocation {see setPty(true)} to
>>>> allow
>>>> remote
>>>>       *             commands to read something from their stdin (i.e.
>>>> from us
>>>>       *             here), thus, (1) remote environment is not set from
>>>>       *             .bashrc/.cshrc and (2) stdout and stderr come back
>>>> merged in
>>>>       *             one stream.
>>>>       */
>>>>
>>>> so I guess my problem is caused through an uninitialized or wrong
>>>> initialized used (pseudo) terminal.
>>>> For me it seems the set environment variables on the used systems (SGE
>>>> cluster, local PC) are not read and the Kepler implementation itself
>>>> doesn't set proper values for the needed variables.
>>>>
>>>> I haven't found a description or a configuration possibility for my
>>>> issue yet. So I think it is some kind of implementation flaw.
>>>>
>>>> Can somebody give me a hint how to solve this issue. I would like to
>>>> run
>>>> Kepler locally on my PC but execute parts of my Workflow remotely on my
>>>> SGE cluster.
>>>>
>>>>
>>>> Kind regards,
>>>> Johann Hoeftberger
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>



-- 


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the Kepler-dev mailing list