Diferencias para "problemas-comunes-slurm" - Escuela de Ingeniería en Bioinfomática

Diferencias entre las revisiones 1 y 2

add new plugin

slurm.conf:
JobSubmitPlugins=job_submit/require_timelimit

file in:
PluginDir=/usr/lib64/slurm

 job_submit_require_timelimit.so

Reason=Node unexpectedly rebooted

after reboot node stas down:
  Reason=Node unexpectedly rebooted

scontrol update nodename=trestles-9-22 state=idle reason=""

slurm.conf:

ReturnToService=2

topology: switch configuration

[2014-12-17T16:25:55.480] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not
use route/topology


Three-dimension Topology
Listing the leaf switches with their nodes

SwitchName=s1 Nodes=comet-01-[01-72] LinkSpeed=58720256
SwitchName=s2 Nodes=comet-02-[01-72] LinkSpeed=58720256

no job will span leaf switches without a common parent.


but:

SwitchName=s1-01 Level=0 LinkSpeed=1 Nodes=comet-01-[54,56-72]
SwitchName=s1-02 Level=0 LinkSpeed=1 Nodes=comet-01-[37-53,55]
SwitchName=s1-03 Level=0 LinkSpeed=1 Nodes=comet-01-[18,20-36]
SwitchName=s1-04 Level=0 LinkSpeed=1 Nodes=comet-01-[01-17,19]
SwitchName=s1 Level=1 LinkSpeed=1 Nodes=comet-01-[01-72] Switches=s1-[01-04]

layout

[2014-12-17T16:25:55.335] layouts: no layout to initialize
[2014-12-17T16:26:39.774] layouts: loading entities/relations information

node name configuration

after error: find_node_record: lookup failure for comet-01-01


# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1   idle hpc-0-5
compute*     up   infinite   1320   down comet-01-[10-64],comet-02-[10-64],comet-03-[10-64].....
compute*     up   infinite    482    unk comet-01-[01-09,65-72],comet-02-[01-09,65-72].....


remove all files in /var/slurm/slurm.state

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite   1802    unk comet-01-[01-72],comet-02-[01-72].....
gpu          up   infinite    146    unk comet-28-[01-72],comet-29-[01-72]
bigmem       up   infinite     40    unk comet-30-[01-20],comet-31-[01-20]

sacct

default start time is "today 00:00:00"

to display previous records use the -S/-E flags

sacct man pages:   -S, --starttime
Valid time formats are...

                 HH:MM[:SS] [AM|PM]
                 MMDD[YY] or MM/DD[/YY] or MM.DD[.YY]
                 MM/DD[/YY]-HH:MM[:SS]
                 YYYY-MM-DD[THH:MM[:SS]]



sacct -S 12/24  --allocations -o "JobId,Start,End,State,User,Group,Account,JobName,Partition,\
    Submit,Eligible,ReqMem,NodeList,NNodes,TimeLimit,DerivedExitCode,\
    ExitCode,CPUTime,MaxPages,MaxVMSize,Elapsed"



jobs displayed start on 12/21

       JobID               Start                 End      State      User     Group    Account    JobName
------------ ------------------- ------------------- ---------- --------- --------- ---------- ----------
2            2014-11-21T14:42:53 2014-11-21T16:43:07    TIMEOUT      jane     testg       test       test
3            2014-11-21T14:42:57 2014-11-21T16:43:07    TIMEOUT      jane     testg       test       test
4            2014-11-21T14:42:57 2014-11-21T15:41:39  COMPLETED      jane     testg       test       test



only if the starttime is specified in the very same format as the
records does sacct show the correct lines:

sacct -S 2014-11-24   --allocations -o
"JobId,Start,End,State,User,Group,Account,JobName,Partition,\
    Submit,Eligible,ReqMem,NodeList,NNodes,TimeLimit,DerivedExitCode,\
    ExitCode,CPUTime,MaxPages,MaxVMSize,Elapsed"


       JobID               Start                 End      State      User     Group    Account    JobName
------------ ------------------- ------------------- ---------- --------- --------- ---------- ----------
10           2014-11-24T16:36:21 2014-11-24T17:35:02  COMPLETED      jane     testg       test       test
11           2014-11-24T16:36:21 2014-11-24T18:36:21    TIMEOUT      jane     testg       test       test
....

sacct per job:

$ sacct -S 2015-10-05 -j 1097598_267
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1097598_267  Frenkel-L+    compute     mia152         24 CANCELLED+      0:0
1097598_267+      batch                mia152         24  CANCELLED     0:15


sacct per job array:

$ sacct -S 2015-10-05 -j 1097598

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1097598_999  Frenkel-L+    compute     mia152         24  COMPLETED      0:0
1097598_999+      batch                mia152         24  COMPLETED      0:0
1097598_0    Frenkel-L+    compute     mia152         24    TIMEOUT      1:0
1097598_0.b+      batch                mia152         24  CANCELLED     0:15
1097598_1    Frenkel-L+    compute     mia152         24  COMPLETED      0:0

slurm scripts on node:

```
/var/spool/slurmd/job00093/slurm_script
```

add 411 files to "Service Node"

check for group in the 411 node onfiguration:

# rocks report host config411 hpc-0-5
file name="/etc/411.conf" perms="0600" owner="root:root"
master url="http://10.1.10.1:372/411.d/"/
appliance: service-node
group: Service_Node


add to 411:

mkdir /var/411/groups/Service_Node


make groups
## Service_Node Group

all: /etc/411.d/Service_Node/etc.slurm.slurmdbd..conf

/etc/411.d/Service_Node/etc.slurm.slurmdbd..conf:: /var/411/groups/Service_Node/etc/slurm/slurmdbd.conf
        /opt/rocks/sbin/411put --group=Service_Node --chroot=/var/411/groups/Service_Node $?



make

generates the 411 Wrote: /etc/411.d/Service_Node/etc.slurm.slurmdbd..conf

account funds

Account - 'comet':Description='comet':Organization='sdsc':Fairshare=1:GrpCPUMins=600
sacctmgr: Parent - 'comet'
sacctmgr: User - 'hocks':DefaultAccount='comet':Fairshare=1:QOS='normal'
sacctmgr: User - 'nicki':DefaultAccount='comet':Fairshare=1:QOS='normal'
sacctmgr: User - 'tanner':DefaultAccount='comet':Fairshare=1:QOS='normal'


show:

# sbank balance statement
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks             38 |          COMET        53 |            10       -43
nicki              7 |          COMET        53 |            10       -43
tanner             8 |          COMET        53 |            10       -43



BUT:

sacctmgr: Account - 'comet':Description='comet':Organization='sdsc':Fairshare=1:GrpCPUMins=600
sacctmgr: Parent - 'comet'
sacctmgr: User - 'hocks':DefaultAccount='comet':Fairshare=999:GrpCPUMins=3600:QOS='normal'
sacctmgr: User - 'nicki':DefaultAccount='comet':Fairshare=1:GrpCPUMins=600
sacctmgr: User - 'tanner':DefaultAccount='comet':Fairshare=1:GrpCPUMins=3600


show:

# sbank balance statement
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks             38 |          COMET        53 |            60         7
nicki              7 |          COMET        53 |            60         7
tanner             8 |          COMET        53 |            60         7


DO NOT use GrpCPUMins for user!!!!!

sacct: error: Problem talking to the database: Connection refused

/etc/slurm.conf:

change
AccountingStorageHost=127.0.0.1

to:
AccountingStorageHost=hpc-0-5

slurm error: authentication: credential expired

munge credential expired ( in munge log as well as slurmctld.log)

--: synchronize clock on node

job rejected: invalid feature {{{#!/bin/bash -l

#SBATCH lines anywhere in the script are interpreted!!!!!

.... #SBATCH -t 1:00:00 #SBATCH -A comet

code

exit

#SBATCH --cpus-per-task=1 #SBATCH --constraint=gtx680 }}}

job exit code 256 (2:0)

# sjobexitmod -l 28
       JobID    Account   NNodes        NodeList      State ExitCode DerivedExitCode        Comment
------------ ---------- -------- --------------- ---------- -------- --------------- --------------
28               (null)        1         hpc-0-6     FAILED      2:0             0:0
6                 comet        1         hpc-0-5     FAILED      1:0             0:0



sched (slurmctl): job_complete for JobId=28 successful, exit code=512
slurm (slurmd)  : sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
sched (slurmctl): job_complete for JobId=6 successful, exit code=256


Job submitted from frontend:

(from srun command: )
slurmstepd: couldn't chdir to `/state/partition1/home/hocks': No such file or directory: going to /tmp instead

on node:
Could not open stdout file /state/partition1/home/hocks/slurm5.out: No such file or directory

----: cd /home/hocks and submit job from there

job exit code 256 (1:0)

/var/log/slurm/slurmctld.log
[2014-10-09T12:14:44.417] sched: Allocate JobId=23 NodeList=hpc-0-[4-5] #CPUs=2
[2014-10-09T12:14:44.446] completing job 23 status 256
[2014-10-09T12:14:44.600] DEBUG: Dump job_resources: nhosts 2 cb 0,8
[2014-10-09T12:14:44.601] sched: job_complete for JobId=23 successful, exit code=256


# sjobexitmod -l 23
       JobID    Account   NNodes        NodeList      State ExitCode DerivedExitCode        Comment
------------ ---------- -------- --------------- ---------- -------- --------------- --------------
23                comet        2     hpc-0-[4-5]     FAILED      1:0             0:0

# sacct -X -j 23 -o JobID,NNodes,State,ExitCode,DerivedExitcode,Comment

       JobID   NNodes      State ExitCode DerivedExitCode        Comment
------------ -------- ---------- -------- --------------- --------------
23                  2     FAILED      1:0             0:0

Job submitted from hpcdev !!!!!!!

job exit codes

POSIX compliant:
Exit codes 129-255 represent jobs terminated by Unix signals.
% perl -le 'print 271 & 127'


137    0:9   exit code 0, signal SIGKILL (-9)
139    0:11  exit code 0, SEG FAULT (11)
256    1:0   submit from wrong machine
512    2:0   home filesystem not found

modify job exit information

Modify Comment: sjobexitmod:

> sjobexitmod -e 49 -r "out of memory" 23

 You are not running a supported accounting_storage plugin
(accounting_storage/filetxt).
Only 'accounting_storage/slurmdbd' and 'accounting_storage/mysql' are supported.

client slurmd failure: Zero Bytes were transmitted

[2014-10-06T15:43:48.907] Gathering cpu frequency information for 8 cpus
[2014-10-06T15:43:48.908] slurmd version 14.03.7 started
[2014-10-06T15:43:48.909] slurmd started on Mon, 06 Oct 2014 15:43:48 -0700
[2014-10-06T15:43:48.909] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 Memory=24151 TmpDisk=39426 Uptime=2018
[2014-10-06T15:43:48.921] error: slurm_receive_msg: Zero Bytes were transmitted or received


munge key permission or owner:

/etc/munge/
399   399  1024 Oct  6 15:10 munge.key


restart munge and slurm:

service munge restart
service slurm restart

topology

scontrol: error: Parsing error at unrecognized key: SwitchName

plugin: /usr/lib64/slurm/topology_tree.so

topology.conf is a seperate file not to be included in slurm.conf

slurm restart

[root@hpc-0-4 ~]# service slurm stop
stopping slurmd:                                           [  OK  ]
slurmd is stopped
[root@hpc-0-4 ~]# ps -ef|grep slurm
root      5922     1  0 13:18 ?        00:00:00 slurmstepd: [7]
hocks     5933  5922  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00007/slurm_script
root      6005     1  0 13:18 ?        00:00:00 slurmstepd: [9]
hocks     6009  6005  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00009/slurm_script
root      6026     1  0 13:18 ?        00:00:00 slurmstepd: [10]
hocks     6035  6026  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00010/slurm_script
root      6056     1  0 13:18 ?        00:00:00 slurmstepd: [11]
hocks     6070  6056  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00011/slurm_script
root      6429  6329  0 13:41 pts/0    00:00:00 grep slurm
[root@hpc-0-4 ~]# service slurm start
starting slurmd:                                           [  OK  ]
[root@hpc-0-4 ~]# ps -ef|grep slu
root      5922     1  0 13:18 ?        00:00:00 slurmstepd: [7]
hocks     5933  5922  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00007/slurm_script
root      6005     1  0 13:18 ?        00:00:00 slurmstepd: [9]
hocks     6009  6005  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00009/slurm_script
root      6026     1  0 13:18 ?        00:00:00 slurmstepd: [10]
hocks     6035  6026  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00010/slurm_script
root      6056     1  0 13:18 ?        00:00:00 slurmstepd: [11]
hocks     6070  6056  0 13:18 ?        00:00:00 /bin/bash -l /var/spool/slurmd/job00011/slurm_script
root      6445     1  0 13:41 ?        00:00:00 /usr/sbin/slurmd
root      6453  6329  0 13:41 pts/0    00:00:00 grep slu

slurm command : Zero Bytes were transmitted or received

# squeue
squeue: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurm_load_jobs error: Zero Bytes were transmitted or received

munge not running or mnuge.key file updated without restart

# service munge restart


May be needed on server machine as well.

Unable to contact slurm controller

$ sbatch sbatch
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)

check: scontrol show config | grep ControlAddr

Protocol authentication error

[2014-10-01T12:15:03.841] error: Munge decode failed: Invalid credential
[2014-10-01T12:15:03.842] error: authentication: Invalid credential
[2014-10-01T12:15:03.842] error: slurm_receive_msg: Protocol authentication error


munge-devel missing:

yum install munge-devel

slurm bash update

A) If you update a login node before compute nodes jobs will fail as
John describes.

B) If you update a compute node when there are jobs queued under the
previous bash then they will fail when they run there (also cannot find
modules, even though a prologue of ours sets BASH_ENV to force the env
vars to get set).


Our way to (hopefully safely) upgrade our x86-64 clusters was:

0) Note that our slurmctld runs on the cluster management node which is
separate to the login nodes and not accessible to users.

1) Kick all the users off the login nodes, update bash, reboot them
(ours come back with nologin enabled to stop users getting back on
before we're ready).

2) Set all partitions down to stop new jobs starting

3) Move all compute nodes to an "old" partition

4) Move all queued (pending) jobs to the "old" partition

5) Update bash on any idle nodes and move them back to our "main"
(default) partition

6) Set an AllowGroups on the "old" partition so users can't submit jobs
to it by accident.

7) Let users back onto the login nodes.

8) Set partitions back to "up" to start jobs going again.

Requested node configuration is not available

sbatch: error: Batch job submission failed: Requested node configuration is not available

check node configuration:

$ scontrol show nodes
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=batch
   Gres=(null)
   NodeAddr=10.1.1.251 NodeHostName=hpc-0-6 Version=14.03
   OS=Linux RealMemory=1 AllocMem=0 Sockets=8 Boards=1

            ^^^^^^^^^^^^^^^^^^^^^^^^^    no memory!!!


compute log shows:
[2014-10-01T12:25:54.157] Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) SocketsPerBoard=8:2(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)


set node configuration in slurm.conf (nodenames.conf):
CPUs=8 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=24151 TmpDisk=39426

Requested node configuration is not available

#SBATCH --nodes=1-1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1


[2014-09-23T14:58:03.833] Job 152 priority: 0.00 + 0.00 + 208.33 + 10.00 + 0.00 - 0 = 218.33
[2014-09-23T14:58:03.834] cons_res: select_p_job_test: job 152 node_req 1 mode 1
[2014-09-23T14:58:03.834] cons_res: select_p_job_test: min_n 1 max_n 1 req_n 1 avail_n 2
[2014-09-23T14:58:03.834] node:hpc-0-4 cpus:8 c:4 s:2 t:1 mem:24151 a_mem:0 state:1
[2014-09-23T14:58:03.834] gres/gpu: state for hpc-0-4
[2014-09-23T14:58:03.834]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2014-09-23T14:58:03.834]   gres_bit_alloc:NULL
[2014-09-23T14:58:03.834] node:hpc-0-5 cpus:8 c:4 s:2 t:1 mem:24151 a_mem:0 state:0
[2014-09-23T14:58:03.834] gres/gpu: state for hpc-0-5
[2014-09-23T14:58:03.834]   gres_cnt found:TBD configured:0 avail:0 alloc:0
[2014-09-23T14:58:03.834]   gres_bit_alloc:NULL
[2014-09-23T14:58:03.834] node:hpc-0-6 cpus:8 c:4 s:2 t:1 mem:24151 a_mem:0 state:1
[2014-09-23T14:58:03.834] gres/gpu: state for hpc-0-6
[2014-09-23T14:58:03.834]   gres_cnt found:0 configured:0 avail:0 alloc:0
[2014-09-23T14:58:03.834]   gres_bit_alloc:NULL
[2014-09-23T14:58:03.834] part:CLUSTER rows:1 pri:1
[2014-09-23T14:58:03.834] part:compute rows:4 pri:1
[2014-09-23T14:58:03.834]   row0: num_jobs 2: bitmap: 0-7,16-23
[2014-09-23T14:58:03.834]   row1: num_jobs 0: bitmap: [no row_bitmap]
[2014-09-23T14:58:03.834]   row2: num_jobs 0: bitmap: [no row_bitmap]
[2014-09-23T14:58:03.834]   row3: num_jobs 0: bitmap: [no row_bitmap]
[2014-09-23T14:58:03.834] part:gpu rows:4 pri:1000
[2014-09-23T14:58:03.834] part:large rows:4 pri:1
[2014-09-23T14:58:03.834] cons_res: cr_job_test: evaluating job 152 on 2 nodes
[2014-09-23T14:58:03.834] cons_res: _can_job_run_on_node: 8 cpus on hpc-0-4(1), mem 0/24151
[2014-09-23T14:58:03.834] cons_res: _can_job_run_on_node: 8 cpus on hpc-0-6(1), mem 0/24151
[2014-09-23T14:58:03.834] cons_res: eval_nodes:0 consec c=8 n=1 b=0 e=0 r=-1
[2014-09-23T14:58:03.834] cons_res: eval_nodes:1 consec c=8 n=1 b=2 e=2 r=-1
[2014-09-23T14:58:03.834] cons_res: cr_job_test: test 0 fail: insufficient resources


works with
#SBATCH --ntasks-per-node=2

job pending

with

JobState=PENDING Reason=AssociationJobLimit

no command to run the job. you can eventually change job priority with
scontrol update job=JOBID priority=....

job_submit

--
-- Check for unlimited memory requests
--
   if job_desc.pn_min_memory == 0 then
      log_info("slurm_job_submit: job from uid %d invalid memory request
MaxMemPerNode", job_desc.user_id)
      return 2044 -- signal ESLURM_INVALID_TASK_MEMORY
   end

scontrol update NodeName=hpc-0-5 State=RESUME

node : State=IDLE*   no slurm daemon running

sbank sbatch flags not supported

$ sbank submit --array=1-4 -J Array ./sleepme 86400
flags:WARN getopt: unrecognized option '--array=1-4'
getopt: invalid option -- 'J'
 -- 'Array' './sleepme' '86400'
flags:FATAL unable to parse provided options with getopt.

sbank array jobs counted 1

$ sbank submit -s sbatch
log: Getting balance for hocks
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks *            4 |           TEST         4 |         3,600     3,596
log: Checking script before submitting
warn: no account specified in the script, using default: test
Current balance      =      3,596
Requested hours      =          1
Expected balance     =      3,595
log: sbatch'ing the script
Submitted batch job 65

sbank no reservations

[hocks@hpcdev-005 ~]$ sbank submit -s sbatch
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks *            6 |           TEST         6 |         3,600     3,594
Current balance      =      3,594
Requested hours      =          1
Expected balance     =      3,593
Submitted batch job 70

[hocks@hpcdev-005 ~]$ sbank submit -s sbatch
log: Getting balance for hocks
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks *            6 |           TEST         6 |         3,600     3,594
Current balance      =      3,594
Requested hours      =          1
Expected balance     =      3,593
Submitted batch job 71

[hocks@hpcdev-005 ~]$ sbank submit -s sbatch
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------

hocks *            6 |           TEST         6 |         3,600     3,594
Current balance      =      3,594
Requested hours      =          1
Expected balance     =      3,593
Submitted batch job 82

scontrol show nodes

8 jobs running but not listed

NodeName=hpc-0-4 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.93 Features=rack-0,8CPUs
   Gres=(null)
   NodeAddr=10.1.1.253 NodeHostName=hpc-0-4 Version=14.03
   OS=Linux RealMemory=24151 AllocMem=0 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=39426 Weight=20488104
   BootTime=2014-05-07T15:54:24 SlurmdStartTime=2014-05-07T16:49:43
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

hpc-0-4
    state = busy
    np = 8
    properties = rack-0,8CPUs
    ntype = cluster
    status = rectime=1399584257,state=busy,slurmstate=allocated,size=40372224kb:40372224kb,ncpus=8,boards=1,sockets=
2,cores=4,threads=1,availmem=24151mb,opsys=linux,arch=x86_64



$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              65_1    hpcdev     test    hocks  R       0:05      1 hpc-0-4
              65_2    hpcdev     test    hocks  R       0:05      1 hpc-0-4
              65_3    hpcdev     test    hocks  R       0:05      1 hpc-0-4
              65_4    hpcdev     test    hocks  R       0:05      1 hpc-0-4
              61_1    hpcdev    Array    hocks  R       3:27      1 hpc-0-4
              61_2    hpcdev    Array    hocks  R       3:27      1 hpc-0-4
              61_3    hpcdev    Array    hocks  R       3:27      1 hpc-0-4
              61_4    hpcdev    Array    hocks  R       3:27      1 hpc-0-4



slurm jobID:

JobId=61 ArrayJobId=61 ArrayTaskId=1 Name=Array
JobId=62 ArrayJobId=61 ArrayTaskId=2 Name=Array
JobId=63 ArrayJobId=61 ArrayTaskId=3 Name=Array
JobId=64 ArrayJobId=61 ArrayTaskId=4 Name=Array

Reason=batch job complete failure

[2014-05-07T15:32:30.110] [38] pam_setcred: Failure setting user credentials
[2014-05-07T15:32:30.110] [38] error in pam_setup
[2014-05-07T15:32:30.110] [38] pam_close_session: Cannot make/remove an entry for the specified session
[2014-05-07T15:32:30.116] [38] job_manager exiting abnormally, rc = 4020
[2014-05-07T15:32:30.116] [38] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status -1
[2014-05-07T15:32:30.241] [38] done with job

sbatch: error: Batch job submission failed: More processors requested than permitted

scontrol show partition

PartitionName=batch
   ....
   Nodes=(null) TotalCPUs=0 TotalNodes=0

slurm configuration

[2014-05-06T12:08:23.649] error: Node hpc-0-4 appears to have a different slurm.conf than the slurmctld.  This could
 cause issues with communication and functionality.  Please review both files and make sure they are the same.  If t
his is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

partition change

changes in the partition table need a slurm restart! All running jobs will be killed

slurm roll rocks compile

Makefile to avoid rocks dummy .spec file:


# Don't re-import Rules-linux-centos.mk
__RULES_LINUX_CENTOS_MK = yes

REDHAT.ROOT = $(CURDIR)/../../

-include $(ROCKSROOT)/etc/Rules.mk
include Rules.mk

ifeq ($(REDHAT.ROOT),)
REDHAT.ROOT     = /usr/src/redhat
endif
ifeq ($(REDHAT.VAR),)
REDHAT.VAR      = /var
endif

REDHAT.SOURCES  = $(REDHAT.ROOT)/SOURCES
REDHAT.SPECS    = $(REDHAT.ROOT)/SPECS
REDHAT.BUILD    = $(REDHAT.ROOT)/BUILD
REDHAT.RPMS     = $(REDHAT.ROOT)/RPMS
REDHAT.SRPMS    = $(REDHAT.ROOT)/SRPMS

ifneq ($(RPM.BUILDROOT),)
BUILDROOT = $(RPM.BUILDROOT)
else
BUILDROOT = $(shell pwd)/$(NAME).buildroot
endif

HOME    = $(CURDIR)

.PHONY: $(HOME)/.rpmmacros
$(HOME)/.rpmmacros:
        rm -f $@
        @echo "%_topdir $(REDHAT.ROOT)" > $@
        @echo "%_buildrootdir $(BUILDROOT)" >> $@
        @echo "%buildroot $(BUILDROOT)" >> $@
        @echo "%_var    $(REDHAT.VAR)" >> $@
        @echo "%debug_package   %{nil}" >> $@

rpm: $(HOME)/.rpmmacros
        rpmbuild -ta $(NAME)-$(VERSION).$(TARBALL_POSTFIX)

clean::
        rm -f $(HOME)/.rpmmacros



version.mk

NAME            = slurm
VERSION         = 14.03.7
TARBALL_POSTFIX = tar.bz2

problemas-comunes-slurm (última edición 2019-06-26 17:25:40 efectuada por FabioDuran)

-  ⇤ ← Versión 1 con fecha 2019-06-26 16:57:55 → 
  Tamaño: 25357
  Editor: FabioDuran
  Comentario:
+   ← Versión 2 con fecha 2019-06-26 17:01:37 → ⇥
  Tamaño: 25360
  Editor: FabioDuran
  Comentario:
-Los textos eliminados se marcan así.
+Los textos añadidos se marcan así.
 Línea 1:
- * add new plugin
 {{{
+add new plugin

 . {{{
-Línea 11:
+Línea 12:
- * Reason=Node unexpectedly rebooted
 {{{
+Reason=Node unexpectedly rebooted

 . {{{
-Línea 22:
+Línea 25:
- * topology: switch configuration
+topology: switch configuration
-Línea 45:
+Línea 49:
- * layout
+layout
-Línea 50:
+Línea 56:
- * node name configuration
+node name configuration
-Línea 69:
+Línea 77:
- * sacct
+sacct
-Línea 137:
+Línea 147:
- * slurm scripts on node:
+slurm scripts on node:
-Línea 141:
+Línea 153:
- * add 411 files to "Service Node"
+add 411 files to "Service Node"
-Línea 224:
+Línea 238:
- * slurm error: authentication: credential expired
+slurm error: authentication: credential expired
-Línea 230:
+Línea 245:
- * job rejected: invalid feature
 {{{#!/bin/bash -l
+job rejected: invalid feature {{{#!/bin/bash -l
-Línea 235:
+Línea 250:
-....
#SBATCH -t 1:00:00
#SBATCH -A comet
+.... #SBATCH -t 1:00:00 #SBATCH -A comet
-Línea 244:
+Línea 256:
-#SBATCH --cpus-per-task=1
#SBATCH --constraint=gtx680
}}}
 * job exit code 256 (2:0)
+#SBATCH --cpus-per-task=1 #SBATCH --constraint=gtx680 }}}

job exit code 256 (2:0)
-Línea 272:
+Línea 283:
- * job exit code 256 (1:0)
+job exit code 256 (1:0)
-Línea 294:
+Línea 307:
- * job exit codes
+job exit codes
-Línea 306:
+Línea 321:
- * modify job exit information
 {{{
+modify job exit information
 {{{
-Línea 317:
+Línea 333:
- * client slurmd failure: Zero Bytes were transmitted
+client slurmd failure: Zero Bytes were transmitted
-Línea 337:
+Línea 355:
- * topology
+topology
-Línea 345:
+Línea 365:
- * slurm restart
+slurm restart
-Línea 374:
+Línea 396:
- * slurm command : Zero Bytes were transmitted or received
+slurm command : Zero Bytes were transmitted or received
-Línea 387:
+Línea 411:
- * Unable to contact slurm controller
+Unable to contact slurm controller
-Línea 394:
+Línea 420:
- * Protocol authentication error
+Protocol authentication error
-Línea 441:
+Línea 469:
- * Requested node configuration is not available
+Requested node configuration is not available
-Línea 536:
+Línea 566:
- * sbank array jobs counted 1
 {{{
+sbank array jobs counted 1
 {{{
-Línea 553:
+Línea 583:
- * sbank no reservations
+sbank no reservations
-Línea 588:
+Línea 620:
-Línea 631:
+Línea 662:
- * Reason=batch job complete failure
+Reason=batch job complete failure
-Línea 640:
+Línea 673:
- * sbatch: error: Batch job submission failed: More processors requested than permitted
+sbatch: error: Batch job submission failed: More processors requested than permitted
-Línea 648:
+Línea 683:
- * slurm configuration
+slurm configuration
-Línea 654:
+Línea 690:
- * partition change
+partition change
-Línea 658:
+Línea 696:
- * slurm roll rocks compile
+slurm roll rocks compile
-Línea 715:
+Línea 755: