Tech Tidbits
Slow I/O Due to Large Files Corrupting the Page Cache
I came across an issue where our builds were taking a very long time. atop, a very handy tool, was showing that there was a lot of I/O activity. The main issue was that reading a large file was taking an incredibly long time. It turns out it was because the kernel was loading these incredibly large files into the cache, and failing to evict them when they were no longer needed. As a result, when another file was read into the cache, instead of reading it immediately into the cache, it had to evict the previous file first. It was pretty obvious from looking at the command free -mh that I could reproduce the issue fairly consistently when the disk cache was very close to the size of memory. Once I freed up the disk cache manually with the command sync; echo 1 > /proc/sys/vm/drop_caches, then disk I/O sped up considerably. This seemed like a hacky solution, so the better fix is to actually instruct the kernel that this file won't be needed again and can be safely evicted from the cache. This can be done with the Linux API fadvise. So remember, if you're loading a bunch of huge files on Linux, the kernel will try to load it into cache (if the memory is available) since it thinks it might be used later. If you know for sure it won't be used later on, then let the kernel know! There are also nifty other flags you can use to let the kernel know about the fileread (like whether it's a it's a sequential read or a random read). You can see more here: https://linux.die.net/man/2/fadvise.
Unresponsive web service due to insufficient number of per-process file descriptors
An issue I came across was a web service that wasn't responsive. It stopped working once we deployed to a fresh new machine, which helped it narrow it down to an issue with the OS setup, and not the binary (it was working just fine on the old machine). The only issue was that I had no idea what the heck was happening! Thankfully, I broke out the handy tool strace to debug all of the OS calls this process was making. From there, I could see the error Too many open files whenever the service tried to open a socket connection. I was still scratching my head because I had set the system wide file descriptor limit to the max it could be. However, it turns out that there was a per-process file descriptor maximum that I had failed to update. Once I updated that value, the server resumed and voila--problem fixed!
Slow Bootup? Maybe Don't Try to Speed it Up with a Memory Mapped Solution
One problem I came across was that some of our web services had ridiculous response times. It was normally under a second, but some were taking two minutes! Ridiculous! I had no idea why these outliers existed, but also why they were so far out of our usual response time. In grafana, I could see that the system load was very high and a large number of processes were waiting on I/O. It turns out a recent change was made to use a memory mapping of a large file instead of simply loading it into memory in an effort to speed up the boot up time of this service. In an effort to speed up boot up time of this service, a recent change was introduced to load a file as a memory mapped file instead of simply loading it into memory. It certainly did speed up bootup time, but it also resulted in the file not being completely loaded into the disk cache so there were a lot of page misses initially that resulted in expensive disk reads. I had two options, to revert back to the old way of just loading the file into memory or keep the memory mapped file and use the flag MAP_POPULATE that would pre-load pages into the cache before it was needed. I found that using the memory mapped file with the MAP_POPULATE flag to be a big improvement in response times, but it still didn't compare to just loading the file into memory. I decided to revert back to the change of loading the file into memory, and just dealing with the slow bootup (i.e., increased service downtime at bootup), in exchange for fast response times since a little bit of downtime wasn't a big deal because ultimately there was enough redundancy in our stack to pick up the slack.
Repurposing a server? Make sure to update the RAID setup to suit the new task!
I know this sounds super obvious, but oftentimes I found we were repurposing servers for different tasks and would often forget about switching the RAID setup if the nature of the task was quite different. I believe initially these servers might've had data on it that needed redundancy, so RAID5 was setup. We then turned these machines into build machines where redundancy wasn't needed at all, and we just needed to get the builds running as quick as possible. We switched it to RAID0 (no redundancy, striping) and got double the space (which was needed for our ever growing builds) and faster builds! So don't be silly like us and make sure you change the RAID setup to suit the task!
Book Recommendation: The Linux Programming Interface
A good friend of mine recommended the book The Linux Programming Interface by Michael Kerrisk. It has been an invaluable resource in understanding these performance related issues within our stack. I especially like the appendices that cover how to debug/strace a process. It has helped me to become a better developer as I now have a better grasp of Linux fundamentals. I highly recommend it!
Stress Testing Memcache - Adjust Local Port Range?
I came across an issue where I was getting an error when stress testing my load cache instance. Normally you'd create a cluster, which was intended for production, but since this was my test environment, there was only one instance. It seems that changing the local port range (/proc/sys/net/ipv4/ip_local_port_range) made quite a difference.
Code Complete, 2nd Edition by Steve McConnell
I received this book as an intern at a company I worked for early in my career. It took me about 13 years to finally pick it up and read it cover to cover. An old boss of mine emphasized the importance of always sharpening your skills by learning on the job, and that you can't cut a tree with a dull blade. It's true, and it took me years to understand this. When I read this book, there was one chapter in particular I wish I had read at the start of my career when I deeply insecure about my abilities as a developer (which continue to this day, but I'm a lot more self-assured about it). The chapter is on Personal Character. The idea is that software development, like most things, requires time and dedication to your craft and it isn't all about having a natural innate ability from birth. Maybe for some that's true, but certainly not for me. I had to learn and unlearn many things, and I'm still learning. I really appreciated this chapter and really honestly wish I could go back in time and read it since I spent a lot of years wondering if tech was right for me. I think just judging based on my interest in computers at a young age, and enjoying solving tech problems, I'm in the right field. It just requires reading a lot and a lot of learning, and striving to be better than what I was yesterday.