Skip to main content

One Version of Events

2 months 3 weeks ago

Jon supports some software that's been around long enough that the first versions of the software ran on, and I quote, "homegrown OS". They've long since migrated to Linux, and in the process much of their software remained the same. Many of the libraries that make up their application haven't been touched in decades. Because of this, they don't really think too much about how they version libraries; when they deploy they always deploy the file as mylib.so.1.0. Their RPM post-install scriptlet does an ldconfig after each deployment to get the symlinks updated.

For those not deep into Linux library management, a brief translation: shared libraries in Linux are .so files. ldconfig is a library manager, which finds the "correct" versions of the libraries you have installed and creates symbolic links to standard locations, so that applications which depend on those libraries can load them.

In any case, Jon's team's solution worked until it didn't. They deployed a new version of the software, yum reported success, but the associated services refused to start. This was bad, because this happened in production. It didn't happen in test. They couldn't replicate it anywhere else, actually. So they built a new version of one of the impacted libraries, one with debug symbols enabled, and copied that over. They manually updated the symlinks, instead of using ldconfig, and launched the service.

The good news: it worked.

The bad news: it worked, but the only difference was that the library was built with debug symbols. The functionality was exactly the same.

Well, that was the only difference other than the symlink.

Fortunately, a "before" listing of the library files was captured before the debug version was installed, a standard practice by their site-reliability-engineers. They do this any time they try and debug in production, so that they can quickly revert to the previous state. And in this previous version, someone noticed that mylib.so was a symlink pointing to mylib.so.1.0.bkup_20190221.

Once again, creating a backup file is a standard practice for their SREs. Apparently, way back in 2019 someone was doing some debugging. They backed up the original library file, but never deleted the backup. And for some reason, ldconfig had been choosing the backup file when scanning for the "correct" version of libraries. Why?

Here, Jon does a lot of research for us. It turns out, if you start with the man pages, you don't get a answer- but you do get a warning:

ldconfig will look only at files that are named lib*.so* (for regular shared objects) or ld-.so (for the dynamic loader itself). Other files will be ignored. Also, ldconfig expects a certain pat‐
tern to how the symbolic links are set up, like this example, where the middle file (libfoo.so.1 here) is the SONAME for the library:

libfoo.so -> libfoo.so.1 -> libfoo.so.1.12

Failure to follow this pattern may result in compatibility issues after an upgrade.

Well, they followed the pattern, and they found compatibility issues. But what exactly is going on here? Jon did the work of digging straight into the ldconfig source to find out the root cause.

The version detecting algorithm starts by looking directly at filenames. While the man page warns about a convention, ldconfig doesn't validate names against this convention (which is probably the correct decision). Insetad, to find which filename has the highest version number, it scans through two filenames until finds numeric values in both of them, then does some pretty manual numeric parsing:

int _dl_cache_libcmp(const char *p1, const char *p2) { while (*p1 != '\0') { if (*p1 >= '0' && *p1 <= '9') { if (*p2 >= '0' && *p2 <= '9') { /* Must compare this numerically. */ int val1; int val2; val1 = *p1++ - '0'; val2 = *p2++ - '0'; while (*p1 >= '0' && *p1 <= '9') val1 = val1 * 10 + *p1++ - '0'; while (*p2 >= '0' && *p2 <= '9') val2 = val2 * 10 + *p2++ - '0'; if (val1 != val2) return val1 - val2; } else return 1; } else if (*p2 >= '0' && *p2 <= '9') return -1; else if (*p1 != *p2) return *p1 - *p2; else { ++p1; ++p2; } } return *p1 - *p2; }

NB: this is the version of ldconfig at the time Jon submitted this, and the version that they're using. I haven't dug through to check if this is still true in the latest version. That's an exercise for the reader.

While we have not hit the end of the first string, check if the character in that string is numeric. If it is, check if the character in the second string is numeric. If it is, keep scanning through characters, and for as long as they're numeric, keep parsing them into numbers. If the numbers aren't the same, we return the difference between them.

If the first string contains numbers at this point, but the second string doesn't, return 1. If the second string contains numbers but not the first, return -1. Otherwise, increment our pointers and go to the next character. If we reach the end of the string without finding numeric characters, return the difference between these two characters.

Also, correct me if I'm wrong, but it seems like a malicious set of filenames could cause buffer overruns here.

Now, I'll be honest, I don't have the fortitude to suggest that ldconfig is TRWTF here. It's a venerable piece of software that's solving an extremely hard problem. But boy, DLL Hell is an unending struggle and this particular solution certainly isn't helping. I'm honestly not entirely certain I'd say that there was a true WTF here, just an unfortunate confluence of people doing their best and ending up laying landmines for others.

But here's the fun conclusion: the 2019 version of the library actually had been updated. They'd deployed several new versions between 2019 and 2024, when things finally blew up. The actual deployed software kept using the backup file from 2019, and while it may have caused hard-to-notice and harder-to-diagnose bugs, it didn't cause any crashes until 2024.

.comment { border: none; } [Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.
Remy Porter

CodeSOD: Invalid Passport

2 months 3 weeks ago

Gretchen wanted to, in development, disable password authentication. Just for a minute, while she was testing things. That's when she found this approach to handling authentication.

passport.authenticate('local', { session: true }, async (err, user) => { if (err) { res.send({ success: false, message: 'Error authenticating user.' }) } else if (!user) { User.query() .where({ username: req.body.username }) .first() .then(targetUser => { if (targetUser) { const hash = User.hashPassword( targetUser.password_salt, req.body.password ) if (hash === targetUser.password_hash) { res.send({ success: false, message: 'Incorrect username or password.', }) } else { res.send({ success: false, message: 'Incorrect username or password.', }) } } else { res.send({ success: false, message: 'Incorrect username or password.', }) } }) .catch(err => { res.send({ success: false, message: 'Internal server error' }) }) } else if (user.firstLogin) { //...... } })(req, res, next);

passport.authenticate invokes its callback after attempting to authenticate. Now, normally, this is called as middleware on a route defined on the webserver- that is to say, you don't call it from within your application code, but as part of your routing configuration. That's not the case here, where this blob is inside of a controller.

That's weird, but let's just trace through this code. We attempt to authenticate. When the process completes, our callback function executes. If authentication failed, there's an error, so we'll send an error message. Then, if the user object isn't populated, we attempt to look up the user. If we find a user with that user name, we then hash their password and check if the hash matches. If it does, we send an error message. If it doesn't, we send an error message. If we didn't find the user, we send an error message. If anything goes wrong, we send an error message.

Wait a second, back up: if the user exists and their password matches, we send an error message?

I'll let Gretchen explain a bit more:

passport.authenticate returns an error if the authentication failed and a user object, if it succeeded. We check this immediately: if error is set, return an error message. But then, we check if the user does not exist (aka: the authentication failed).

Yes, the reason user would be null is because the authentication failed. So the error would be set. So that entire branch about validating the user won't happen: either the authentication worked and we know who the user is, or it failed, in which case we'd have an error. There's no reasonable world where there's no error but also no user object.

So yes, if authentication failed, but you manually re-run the authentication and it succeeds for some reason, yeah, you probably should still return an error. But I don't know if it's "Incorrect username or password". It probably should be "Invalid reality, please restart the universe and see if the problem persists."

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!
Remy Porter

Error'd: When All You Have is a Nail

2 months 3 weeks ago

...everything looks like a hammer.

"Where is this plane?" wondered erffrfez (hope I spelled that right), explaining "I was on a flight across Aus, and noticed that the back of seat display doesn't seem to know exactly where the plane is. There are two places where 'distance to destination' is displayed. They never matched and the difference varied through the flight." I have a suspicion this is related to the January 20 WTF.

 

Roadtripping Adam R. rued "I'm studying the time tables for the buses in the Italian alps, and I was a little dismayed to find that 19 hours will have elapsed between the 11:05 departure and 12:05 arrival. "

 

Also Adam R. in the airport shared "After my inbound flight to MAD was cancelled, AA couldn't decide if my connecting flight was arriving or departing, so it decided that it was undefineding."

 

Meanwhile B.J.H. "Earlier the monitor had accurate information about a departed flight, but they were able to fix that by bringing in a technician to get the display ready for the flight to Godot."

 

Also B.J.H. pleaded "Won't someone help LATAM airlines find the missing key so we can leave." Apparently they found it eventually.

 

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!
Lyle Seaman

CodeSOD: Brillant Python Programmers

3 months ago

Sandra from InitAg (previously) tries to keep the team's code quality up. The team she's on uses CI, code reviews, linting and type checking, and most important: hiring qualified people. Overall, the team's been successful recently. Recently.

The company got its start doing data-science, which meant much of the initial code was written by brilliant PhDs who didn't know the first thing about writing software. Most of that code has been retired, but it is impossible to dispatch all of it.

Which brings us to Stan. Stan was a one-man dev-team/sysadmin for a mission critical piece of software. No one else worked on it, no one else looked at it, but that was "okay" because Stan was happy to work evenings and weekends without anyone even suggesting it. Stan loved his work, perhaps a little too much. And as brilliant as Stan was, when it came to software engineering, he was at best "brillant".

Which brings us to a file called utils/file_io.py. As you might gather from the full name there, this is a "utility" module stuffed with "file and IO" related functions. Let's look at a few:

def del_rw(action, name, exc): """ See https://stackoverflow.com/questions/21261132/shutil-rmtree-to-remove-readonly-files Is used by shutil.rmtree to remove read only files as well """ os.chmod(name, stat.S_IWRITE) os.remove(name)

This, I think, is a case of "maybe this shouldn't be a function?" Given that it's only two lines, it's more clear what you're doing if you just include the lines. Also, it takes three parameters and uses one of them.

def safe_rmtree(remove_dir, minimum_parent_dir, min_parent_parts=2): """ Recursively removes all files in a directory that should contain the min parent dir :param remove_dir: :param minimum_parent_dir: :param min_parent_parts: nr of folders that the parent should contain to avoid removing eg C: :return: """ if not os.path.exists(remove_dir): print(f"WARNING in safe_rmtree: {remove_dir} does not exist, nothing is removed") return remove_path = pathlib.Path(remove_dir) minimum_parent_path = pathlib.Path(minimum_parent_dir) if not minimum_parent_path.is_dir(): raise AssertionError("Min parent dir not a valid dir {}".format(minimum_parent_dir)) if len(minimum_parent_path.parts) <= min_parent_parts: raise AssertionError( "Unsafe removal of {}: minimum parent dir is too short {}".format( remove_dir, minimum_parent_dir ) ) if minimum_parent_path in remove_path.parents: print("REMOVE {}".format(remove_dir)) shutil.rmtree(remove_dir, onerror=del_rw) # use del_rw to also remove read only files else: raise AssertionError( "Unsafe removal of {}: does not contain minimum parent dir {}".format( remove_dir, minimum_parent_dir ) ) time.sleep(0.1)

I definitely don't like this. We do at least use all of our parameters. While the documentation doesn't really make it clear what they all do, this will remove a directory if and only if the directory is contained under minimum_parent_dir, and the minimum_parent_dir is at least min_parent_parts levels deep.

I understand, broadly, why you want some checks around a potentially destructive operation, but I have to wonder about why this is the set of checks we added.

Also, why the tenth of a second sleep at the end? shutil isn't doing things on background threads, it just manipulates the file-system via syscalls.

Now, what's notable about this one is that we're using the pathlib.Path API for interacting with file paths. This is the correct way to do things in Python, which is why this method is just funny:

def file_name_from_path(path, extension=False): """ Get the name of a file (without the extension) given a path :param path: :param extension: default false, but if true fn will include extension :return: a string with the file name (optional extension) """ basename = ntpath.basename(path) if extension is True: return basename else: return os.path.splitext(basename)[0]

This reinvents methods that are already in pathlib.Path. I also like that the documentation contradicts itself: this returns the name of the file (without the extension) except when it also returns the extension.

def create_dir(new_dir, remove_data=False, verbose=False): """ Create a new directory. NOTE: this function only creates one new folder, the root folder must already exist. :param new_dir: directory to create :param remove_data: if yes, remove existing data in folder. :param verbose: :return: a string containing the new directory """ if remove_data and os.path.exists(new_dir): shutil.rmtree(new_dir) time.sleep(0.1) if os.path.exists(new_dir): if verbose: print("Output directory {} already exists".format(new_dir)) else: os.makedirs(new_dir) return new_dir

This function creates a new directory and possibly deletes the old one. Not "safely", though, but hey, that random sleep is back.

def dirs_in_dir(data_dir, contains=""): """ Get the paths of all directories in a directory containing a set of characters, the basename :param data_dir: directory for which to list all paths with a base name :param contains: characters that should be in the name of the directory :return: a list of all directory paths """ if not os.path.exists(data_dir): print("Warning in get_dirs_in_dir: directory {} does not exist".format(data_dir)) return dir_paths = [ os.path.join(data_dir, d) for d in os.listdir(data_dir) if (os.path.isdir(os.path.join(data_dir, d)) and contains.lower() in d.lower()) ] return dir_paths

This is another reinvention of functionality that already exists in pathlib.Path. I'd say, "And they already know about that!" but they clearly don't; they copied code off StackOverflow and didn't read it.

How about writing to a log file in possibly the slowest way possible?

def write_to_log(text_string, file_path, level="INFO", code="I_NOTDEFINED", entry="PROJ"): print(f"Log: {text_string}", flush=True) if not os.path.exists(file_path): file = open(file_path, "w") file.close() lookup_dict = error_lookup_dict responsible = lookup_dict[code]["responsible"] with open(file_path, "a") as file: dt_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") log_text = "{}; {}; {}; {}; {}; {}".format( dt_string, entry, level, responsible, code, text_string + "\n" ) file.write(log_text)

Most logging frameworks open a file when you start, and keep the handle open until you're done logging. This one not only checks if it exists, but reopens it again and again for every write. And while it's at it, it also prints what it's logging. And Python has a built-in logging framework that definitely handles all these features.

Finally, what file_io library would be complete without a way to call shell commands?

def run_command_window(cmd, log_path=None): std_output = [] updated_env = {**os.environ, "PYTHONPATH": str(config.repos_dir)} with Popen( cmd, stdout=PIPE, bufsize=1, universal_newlines=True, shell=sys.platform == "linux", env=updated_env, ) as p: for line in p.stdout: print(line, end="", flush=True) # process line here std_output.append(line) if log_path: with open(log_path, "a") as log: log.write(line) return p.returncode, std_output

Broadly speaking, if you're trying to automate things via Python, you might come up with something very much like this. But I draw your attention to their updated_env which populates a PYTHONPATH variable: they're calling Python code by shelling out and launching a new interpreter.

Sandra writes:

This is one of the better files in the project.
Thankfully, this one has a happy ending: management is well aware that this project is legacy garbage, and has authorized a full-scale rework to make it maintainable. Whether the guy doing the work escapes with his sanity intact is another question

Honestly, as "I don't actually know Python but I'm very smart," code goes, this isn't so bad. I mean, it's terrible, but it's clearly written. Even the parts that shouldn't have been written are clean and easy to understand. That, I suppose, just makes it worse, doesn't it. Picking through the code that needs to be thrown away will be harder than one expects because you can't just go, "kill it with fire," you have to think about what you're throwing away.

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.
Remy Porter
Checked
6 minutes 58 seconds ago
Curious Perversions in Information Technology
Subscribe to The Daily WTF feed