Skip to content

CS 4.11.2+ SSVM bug when scanning templates #3838

@bwsw

Description

@bwsw
ISSUE TYPE
  • Bug Report
COMPONENT NAME
SSVM
CLOUDSTACK VERSION
4.11.2 and further, master is affected
CONFIGURATION
OS / ENVIRONMENT

N/A

SUMMARY

Several months ago I found that SSVM occasionally loses all the templates (we have a lot of them) causing massive redownloading which takes a lot of time. Today I started the investigation and found following logs on SSVM (DEBUG enabled):

2020-01-26 02:58:39,710 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) Processing command: com.cloud.agent.api.storage.ListTemplateCommand
2020-01-26 02:58:39,710 INFO  [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Determined host 10.252.2.4 corresponds to IP 10.252.2.4
2020-01-26 02:58:39,711 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Mounting device with nfs-style path of 10.252.2.4:/secondary
2020-01-26 02:58:39,711 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) mount nfs://10.252.2.4/secondary on /mnt/SecStorage/c082d67e-cc0c-3655-adbf-cfb545c7beec
2020-01-26 02:58:39,711 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) making available /mnt/SecStorage/c082d67e-cc0c-3655-adbf-cfb545c7beec on nfs://10.252.2.4/seconda
ry
2020-01-26 02:58:39,711 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) local folder for mount will be /mnt/SecStorage/c082d67e-cc0c-3655-adbf-cfb545c7beec
2020-01-26 02:58:39,712 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Executing: mount 
2020-01-26 02:58:39,713 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Executing while with timeout : 1440000
2020-01-26 02:58:39,714 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Execution is successful.
2020-01-26 02:58:39,715 DEBUG [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Some device already mounted at /mnt/SecStorage/c082d67e-cc0c-3655-adbf-cfb545c7beec, no need to m
ount nfs://10.252.2.4/secondary
2020-01-26 02:58:39,718 DEBUG [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) Executing: /usr/local/cloud/systemvm/scripts/storage/secondary/listvmtmplt.sh -r /mnt/SecStorage/c082d67e
-cc0c-3655-adbf-cfb545c7beec/template/tmpl/ 
2020-01-26 02:58:39,720 DEBUG [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) Executing while with timeout : 3600000
2020-01-26 02:58:40,170 DEBUG [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) Exit value is 143
2020-01-26 02:58:40,175 WARN  [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) Exception: /usr/local/cloud/systemvm/scripts/storage/secondary/listvmtmplt.sh -r /mnt/SecStorage/c082d67e
-cc0c-3655-adbf-cfb545c7beec/template/tmpl/ 
java.io.IOException: Stream closed
        at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at com.cloud.utils.script.OutputInterpreter.processError(OutputInterpreter.java:37)
        at com.cloud.utils.script.Script.execute(Script.java:289)
        at org.apache.cloudstack.storage.template.DownloadManagerImpl.listTemplates(DownloadManagerImpl.java:809)
        at org.apache.cloudstack.storage.template.DownloadManagerImpl.gatherTemplateInfo(DownloadManagerImpl.java:824)
        at org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.execute(NfsSecondaryStorageResource.java:2052)
        at org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.executeRequest(NfsSecondaryStorageResource.java:306)
        at com.cloud.agent.Agent.processRequest(Agent.java:645)
        at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1083)
        at com.cloud.utils.nio.Task.call(Task.java:83)
        at com.cloud.utils.nio.Task.call(Task.java:29)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-01-26 02:58:40,182 INFO  [cloud.agent.Agent] (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
2020-01-26 02:58:40,187 INFO  [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) found 0 templates[]
2020-01-26 02:58:40,188 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) Sending shutdown to management server
2020-01-26 02:58:40,188 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) Seq 190-6215811910701416453:  { Ans: , MgmtId: 179116088862604, via: 190, Ver: v1, Flags: 110, [{"com.cloud.agent.api.storage.ListTemplateAnswer":{"secUrl":"nfs://10.252.2.4/secondary","templateInfos":{},"result":true,"details":"success","wait":0}}] }
2020-01-26 02:58:40,240 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) Location 1: Socket Socket[addr=/10.252.2.1,port=8250,localport=39288] closed on read.  Probably -1 returned: Connection closed with -1 on reading size.

I suppose the reason SSVM starts redownload templates is that /usr/local/cloud/systemvm/scripts/storage/secondary/listvmtmplt.sh is killed by someone. I try to execute /usr/local/cloud/systemvm/scripts/storage/secondary/listvmtmplt.sh -r /mnt/SecStorage/c082d67e-cc0c-3655-adbf-cfb545c7beec/template/tmpl/ manualy, it works well. Timeout of 3600000 is also fine, templates are listed in 30 seconds.

Tried to find OOM kill, no such information in dmesg:


STEPS TO REPRODUCE

Probably, large template hierarchy with significant listing time. Script delay BEFORE FIRST LINE IS RETURNED with sleep also should produce the undesired behavior.

EXPECTED RESULTS

SSVM doesn't download 'good' templates more than one time.

ACTUAL RESULTS

SSVM redownloads whole SSVM templates occasionally.

I implemented a workaround in:

/usr/local/cloud/systemvm/scripts/storage/secondary/listvmtmplt.sh

if [ -f $rootdir/template.cache ]
then
	cat $rootdir/template.cache
	exit 0
fi

and build the cache manually on the SSVM from outside (on the storage with the same script in CRON):

root@cs2-secstorage-nfs1:/etc/cron.5min# cat ssvm-workaround-fixit 
#!/bin/bash

REALDIR=/secondary/template/tmpl/
DIR=$(cat /secondary/template/tmpl/location.systemprefix)
/opt/bin/listvmtmplt.sh -r $REALDIR | sed "s#$REALDIR#$DIR#g" > /secondary/template/tmpl/template.cache.tmp
cp /secondary/template/tmpl/template.cache.tmp /secondary/template/tmpl/template.cache

So, looks like when listvmtmplt.sh is run for a long with NO RETURN STDOUT LINES it's killed...

Tracing

I found how the bug can be traced:

1st: let's look at execute which launches template lister.

It first launches process and next launches executor task which gathers output with a timeout.

2nd: let's look at TimedOutLogger which is used to get logs.

The implementation is probably wrong, as it doesn't assume that first state for reader is not ready. It assumes that it's initially ready, but ready is when the process returned something, if it didn't then it's not ready... and it terminates the process.

So, next look at Task implementation. It includes notifyAll, which interrupts the thread and causes TimedOutLogger is run in task.

Aannd.... if the script didn't return the data when the task where TimedOutLogger is launched, it's terminated by TimedOutLogger.

As for me the implementation is very unclear and error-prone and should be reimplemented. It's a critical bug for ssvm which is acting when the script is slow, which is a normal case for large secondary storages or simply when under load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions