When a system does not playback in realtime - for cases that look similar to others that were working fine before - it can be due to many different reasons.
In our experience, a common mistake is to suppose that "it must be for the same reason as the previous time", as in this particular cause a same symptom can be related to many different causes. We strongly recommend to approach this type of problem with an open mind and not to pre-judge the cause, as this has demonstrated to create long delays to find solutions that are typically easy to apply once the problem is understood.
In this document we will try to provide a strategy to find the correct diagnostic.
In general, these are the most common causes for realtime problems found in support cases (in no particular order):
- Disk speed issues (too much requirements of uncompressed formats, raid rebuild in progress, saturation due to unnoticed simultaneous processes, fragmentation,.. )
- CPU issues (too complex requirements for compressed codecs, non optimal settings for performance settings ...)
- GPU issues ( graphics memory fragmentation or exhausted, high temperature, electrical issues,... )
- Memory issues ( memory "Swapping")
- Video board problems (non coherent input signals confusing the board clock, failed units, wrong firmware )
- Operational issues ( unnoticed activation of certain settings (Stereo3D, reduced Playback Speed settings, etc ), unnoticed complexity of the timeline, media files of different types than expected or not in the expected realtime locations, ingest of enumerated sequences made in non alphanumerical order, inefficient combinations of video settings ...)
- PCIe slots (non optimal position of PCIe boards after hardware changes)
- Electrical issues
- Temperature issues
- Hardware failures or cabling issues
And there are more...
Before going into matter, Mistika provides a couple of indicators that can help us on our diagnostics (they only apply to SDI playback ):
* Lost Sync message: If it appears, it will indicate that realtime has been lost at some point (even for just one frame). The message will remain there until stopping the playback
* Ring buffer value: It is a number appearing at the top right of the record monitor red bar. It tell us how many images are pre-rendered by the processing-ahead engine. When it decreases that part of the timeline is not being processed in realtime, although realtime will not be lost until it goes too low (a dangerous level is when it goes smaller than 2 x pipe units value). Small changes up and down are perfectly normal, but a constant decrease indicates that we are passing by a too complex area or that there is a problem in there. That is also the reason why sometimes a certain area can sustain playbacks but not others, as it also depends where you started it. The mConfig->Performance->Cache can also produce this effect in the case of uncompressed images.
Also please note that playback to Live Video is in general clearly faster than playback to graphics. This is because the GPU can concentrate in rendering and does not need to draw the images, but specially because the special SDI ring buffer (which is not avaiable for GUI playback).
A methodical strategy to isolate the problem
First thing, make sure that all the media files are local (either in a SAN or in a local storage). Using network drives can make diagnostics very complex and uncertain due to the RAM caches that are involved and the other computers that are involved. If you are using network drives first try with local resources. If it works in realtime with local resources then the diagnostic actions can be much more focused.
When there is not an obvious reason there are a few tests that will give us a lot of information:
- Check for error messages.
Launch Mistika in a console and keep it visible while doing playbacks to check for abnormal errors.
Check the content of the /var/log/mesages file for hardware errors. You can also track the hardware feedback during a playback, by opening a console and executing this:
tail -f /var/log/messages
Any posterior hardware issue happening while rendering may provide useful feedback in that console. Pay special attention to CPU temperature issues.
- If it is not a recurrent problem, reboot the system and try again. Maybe it was a one-time problem. And if it is not, check if it does not happen at the beginning but only after a while, as this is going to be valuable information for some of the points below.
- Switch off Live video and try again ( playback to GUI monitor ). If it can playback in realtime then the problem is probably one on the list above about Video Board issues.
- Open a system performance monitor (for example, type gnome-system-monitor in a console). During the playack, check that the memory curve is not going into swap levels and the CPU curves are not too high.
* A single CPU curve (or few of them) appearing constantly at 100% with the other at rest indicates that those particular files can not be properly paralelised for some reason (if they are expected to work well in parallel, render another version to the same format to see if the problem is related to that particular files ).
* If many CPU curves are high but not all of them, check mConfig->PerformanceSettings->PipeUnits. In general, high values are recommended (24 is a good reference, but some storage models may fail to work efficiently with high numbers ).
* If the problem is related to CPU saturation (all CPU cores high) then there is not too much that you can try. Check that Hyper-threading is active in the BIOS. If you are using all the cores it is just that the system can not do what you want.
- If the problem was not CPU related next thing is to check if it is related with media files or disk speed: Substitute the media clips by fast effects ( Wipe fx are ideal ). If it can playback in realtime, then the problem is most probably a disk speed issue or a speed issue related with the media files. If that is the case:
* Now do the opposite. Remove all the effects and only playback the media files. If it does not work it will confirm that the problem is related to disk speed or to that particular media files.
* If it works in realtime for a while before realtime is lost, Check mConfig->PerformanceSettings->RingBuffer. Values smaller than 125 are not recommended. While very high values may trigger swapping which is very slow. Also, If the storage is internal, dust accumulation in the fan filters or improper ventilation can create temperature issues leading to low performance or random interruptions, typically soon after booting up the system even if starting from a cold state.
* Render the media files to .js format, at the same resolution and color space as the originals. This is not a workaround, this test will tell you if the problem is a disk speed issue or an image format issue. If mistika .js works in realtime then it is a media format issue, if not it is a disk speed issue. ( BTW the potential workaround for both could be to render a "Playback Cache", which uses the exact video format that is active).
* As a result of previous tests, If the problem was related to the particular media files (not to raw disk speed), copy the media files to other location and test with the new version. An ingest to non consecutive disk blocks is the most common reason for this issue. It can happen when copying files with non specialised OS tools ( using drag & drop with network drives or external drives is a highway to disaster). Fragmentation is also a potential reason, using tools like xfs_fsr or snfsdefrag can be required. Please see the point below about "FRAGMENTATION OF FILES AND FRAGMENTATION OF FREE SPACE: Why it happens, how to fix it and how to prevent it"
* if the problem is related to disk speed, before contacting SGO support make sure that the files are located in a storage unit provided by SGO. Third party units are never optimised for Mistika (whatever thay say), even when using the same hardware. And our engineers will not be able to diagnose problems related to storage models that they don't have in SGO offices. The lack of at least a minimal SGO certified storage volume to test in there has demonstrated to be extremely time consuming at both sides when a problem arise.
- If the problem is neither related to CPU, disk speed or media files, then it could be related to GPU usage. Remove all the effects and add them one by one for a better analysis. Also remind that the activation of optical flow parameters can make a certain effect much slower that it was.
Problem: I am still not convinced. The system is abnormally slow or at least at certain times, it was working much faster and stable before!
If it works fast for a while and much slower from there, then it sounds to be a GPU issue: When there is no more graphics memory it will start swapping with standard RAM, which produce a huge impact to the performance. As a difference to temperature problems and other hardware issues this is normally solved by restarting Mistika. If it is the case, if it happens often it means that you need a GPU with more graphics memory. Also, if you do not have a high end board with a lot of memory try to avoid working with images of many different resolutions on a same session, as it will reach to this point much earlier.
If the difference is huge and it is not dependent on the project or particular session , the most probable reason is a hardware issue. In a case like this check the following points:
If you have made hardware changes recently, review the slot positions for the boards. Some combinations can create troubles. And some board models can create troubles to the system performance.
At least, from the previous tests you should already know if the problem is related to disk speed. If it was the case:
* A known cause of "random" disk speed problems is dust in the fiber connectors (they are made of glass). Disconnected them and blow the fiber connectors. Do not let the connectors to touch any surface before plugging them (specially the floor!)
* Stop all the disk array activity for a while and check disk array lights when there is no activity. If you see a lot of activity in a group of disks then it will indicate that there is a broken disk and the corresponding rebuild process is in place (it can take up to one day, and support need to be contacted asap). If there is activity in more than one group of disks then you need to find what is causing it, because we said that we were stopping all the activity for this test isn't it? so maybe there is a forgotten copy process, unwanted network access from other computers, an scheduled de-fragmentation process, etc.
If the disk speed / media files / CPU issues have been discarded in previous tests, also check the GPU temperature during playbacks (use nvidia-settings for that). Dust accumulation in the fan filters or improper ventilation can create temperature issues with an impact on performance and stability.
Insufficient power supply is also a known cause of performance issues. Remove all the unnecessary boards and hard disks and try a playback test in the simplest configuration as possible.
Another well known issue that can cause a high impact in performance or stability is a difference on earth levels of electrical phase between equipment (workstation, monitors (GUI and SDI), external hard disks, etc. Try to organise a test with everything connected to the same power stripe (only the equpment connected trough optical fiber do not need to be taken into account, as these cables do not carry electricity )
If the problem only happens in live video: When using DVS boards with input signals different than output format have been reported to cause performance problems. In AJA boards, if they can not playback a simple clip (just a Wipe effect) it may indicate a broken unit.
If there is an UPS: Bypass the UPS and use a reliable power plug. Most UPS degrade over the time and some models can not provide all the power that is required, which specially affects the GPU performance. Also make a test with all the monitors and all devices are connected to the same power stripe as the workstation. Earth level differences and other electrical issues are a common causes for GPU performance issues
Some image formats can benefit from faster storage unis, even when the nominal bandwidth of the storage unit seems to be superior to the format specs. In particular, tiff16 and exr ZIP/PIZ 4K require storage devices particularly fast. Those formats are particularly demanding in the particular way that they need to access the media files. (ask SGO support for more details).
FRAGMENTATION OF FILES AND FRAGMENTATION OF FREE SPACE: Why it happens, how to fix it and how to prevent it
A file is fragmented when it is not in consecutive disk blocks. In rotational drives, this forces the mechanical disk heads to jump constantly between different areas during a playack, which is a slow process. It also destroy the efficiency of read-ahead caches.
It happens due to two main reasons:
- Using the storage almost full over long periods of time: If you work at 90% or more all the time, as you delete files here and there you make free space that is not consecutive. Then, the new files are forced to be split in small pieces wherever there is some space, even if they do not fit well.
The next point is much worst, (and if you combine both the situation can become pretty dramatic )
- Sequence interleaving: It happens when two or more process are writting to the filesystem at the same time. For example two copy process, or a copy process and one render, and it is extremely dangeorus when if it is done with enumerated sequences. The files will not be neccesarily fragmented initially, but both sequences will be interleaved on disk (specially if there is few space). At this moment they will still be realtime because even if not consecutive the frames are still close to each other. But later when you delete one squence it will free a lot of small non consecutive disk areas of one image frame of size. Then, a new sequence is ingested and will have to use that space, but if the new frames are bigger then they will need to be splitted in available areas, or if they are smaller they will divide the free space even more. Wehn you delete an interleaved sequence, what is fragmented is the free space, which will lead to extreme file fragmentation later.
- How to prevent it:
Never execute two write processes at the same time on the same filesystem. And if you are forced to do that or if it happened unadvertedly, make sure that you delete the files from both of them together. And only work with a filesystem almost full for short periods of time.
Note: In the particular case of the latest MISTIKA-SAN storage volumes provided by SGO this is not a problem (our special setup can force simultaneous ingest or render processes to use separate areas for each one), but in all other cases (including XFS filesystems) it is a very bad idea.
- How to detect it:
Render a .js file. If it does not playback in realtime when it should, execute this:
xfs_fsr -v PathToFile
It will tell you the extents that it had, and it will defrag it if possible
Note: When using xfs_fsr the file can not be in use, or it will refuse to defrag it
- How to fix it:
If you have an urgent job, try to defrag just that file as explained.
To solve it for the whole filesystem:
If possible, make a backup and reformat it. It is the fastest method and the more efficient
If that is not possible:
1 - Make a lot of space, as much as possible.
2 - Run xfs_fsr against the whole filesystem. But don't do it until you make space, or it will probably not worth it and you will put a lot of stress on mechanical disk heads, thus reducing their lifetime.
xfs_fsr -v PathToStorage
Normally each run will stay for up two hours, and it will give you information about what happens with each file before and after. You can do more passes, but if the situation does not improve sufficiently them the recommended solution would be to make a backup a reformat the filesystem (sometimes this is the fastest solution, because defragging is a slow iterative process
3 - For the particular case of enumerated sequences: If a sequence does not playback in realtime after defragmentaion it can be because even if the frames are not fragmented each one has ended in a different place.
Then, after doing the defragmentation just copy the sequences folders to a different place (on the same filesytem). That will make a new version using unfragmented space (if it is available). But it is critical to copy the sequence in consecutive order, please always follow this rule:
Never copy an enumerated sequence using a Drag & Drop action or "cp" command. The operating system does not know what is an "image sequence" and it will not try to keep it together, nor it will use alphanumerically order neccesarily. This apply both to local system and to remote actions on network drives. Instead, use an speciallized application that can follow the frame order, like a render process, an ftp application (shorted by filenanames), the mistika "mtransfer" application, or the linux rsync (or grsync for having a GUI) on linux
In general, a same storage model can provide much more realtime performance if it is specifically optimised for Mistika. For this reason, storage units provided by SGO are in general double faster or more than the same model with the default settings used by other providers.
However this is out of the scope of this document, as there are tens of settings to take into account and they change constantly. The optimal values have to be found experimentally and sometimes new settings need to be fine tuned inside the Mistika software by our developers. So if you want a truly fast storage please contact SGO support services before acquiring your next storage upgrade.
ELECTRICAL AND TEMPERATURE PROBLEMS THAT CAN AFFECT REALTIME PERFORMANCE
If none of the above was conclusive, then check for these points.
- Temperature problems in the GPU or CPU, and other hardware issues. This can happen due to inadequate ventilation, but also due to damaged hardware. In general temperature issues can manifest themselves with system crashes, but also with very obvious performance problems without crashing..
* Use the nvIdia-settings tool to check for temperature problems while heavy rendering. It should not go into the red zone, but yellow zone is normal.
* Check the /var/log/messages file. Check for temperature issues on the CPUs, and also for other hardware warnings appearing during playbacks.
- Insufficient electrical power is also a known cause of both severe performance issues and crashes. . SGO turnkey systems are tested to work with all components at 100%, but if you have added extra components or your workstation did not come from SGO then we would recommend to test by reducing the power usage.. If all diagnostics have failed, then remove all the unnecessary boards, install a smaller GPU and test again.
If there is an UPS, do a test bypassing the UPS and using a reliable power plug. Most UPS degrade over the time and some models can not provide all the power that is required, which is known to affect the workstation performance.
- Electrical issues: Earth levels & phase differences. Another well known issue that can cause a high impact in performance or stability is a difference on earth levels or a difference in electrical phase between equipment (workstation, monitors (GUI and SDI), external hard disks, etc. Try to organise a test with everything connected to the same power stripe (only the equipment connected trough optical fiber does not need to be taken into account, as these cables do not carry electricity ).