如何正确地将字节数组反序列化回 C++ 中的对象?

2022-01-25 00:00:00 arrays deserialization embedded c++

我的团队已经有这个问题几周了,我们有点难过.善意和知识将被优雅地接受!

My team has been having this issue for a few weeks now, and we're a bit stumped. Kindness and knowledge would be gracefully received!

使用嵌入式系统,我们尝试序列化一个对象,通过 Linux 套接字发送它,在另一个进程中接收它,然后将它反序列化回原始对象.我们有如下反序列化函数:

Working with an embedded system, we are attempting to serialize an object, send it through a Linux socket, receive it in another process, and deserialize it back into the original object. We have the following deserialization function:

 /*! Takes a byte array and populates the object's data members */
std::shared_ptr<Foo> Foo::unmarshal(uint8_t *serialized, uint32_t size)
{
  auto msg = reinterpret_cast<Foo *>(serialized);
  return std::shared_ptr<ChildOfFoo>(
        reinterpret_cast<ChildOfFoo *>(serialized));
}

对象已成功反序列化,可以从中读取.但是,当调用返回的 std::shared_ptr 的析构函数时,程序会出现段错误.Valgrind 给出以下输出:

The object is successfully deserialzed and can be read from. However, when the destructor for the returned std::shared_ptr<Foo> is called, the program segfaults. Valgrind gives the following output:

==1664== Process terminating with default action of signal 11 (SIGSEGV)
==1664==  Bad permissions for mapped region at address 0xFFFF603800003C88
==1664==    at 0xFFFF603800003C88: ???
==1664==    by 0x42C7C3: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:149)
==1664==    by 0x42BC00: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:666)
==1664==    by 0x435999: std::__shared_ptr<ChildOfFoo, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:914)
==1664==    by 0x4359B3: std::shared_ptr<ChildOfFoo>::~shared_ptr() (shared_ptr.h:93)

我们愿意接受任何建议!谢谢你的时间:)

We're open to any suggestions at all! Thank you for your time :)

推荐答案

一般情况下这样是行不通的:

In general, this won't work:

auto msg = reinterpret_cast<Foo *>(serialized);

您不能只取任意字节数组并假装它是有效的 C++ 对象(即使 reinterpret_cast<> 允许您编译尝试这样做的代码).一方面,任何包含至少一个虚方法的 C++ 对象都将包含一个 vtable 指针,该指针指向该对象类的虚方法表,并在调用虚方法时使用.但是,如果您在计算机 A 上序列化该指针,然后通过网络发送它并反序列化,然后尝试在计算机 B 上使用重构的对象,您将调用未定义的行为,因为无法保证该类的 vtable 将同时存在计算机 B 上的内存位置,它在计算机 A 上所做的.此外,任何执行任何类型的动态内存分配的类(例如任何字符串类或容器类)都将包含指向它分配的其他对象的指针,这将引导您到同样的无效指针问题.

You can't just take an arbitrary array of bytes and pretend it's a valid C++ object (even if reinterpret_cast<> allows you to compile code that attempts to do so). For one thing, any C++ object that contains at least one virtual method will contain a vtable pointer, which points to the virtual-methods table for that object's class, and is used whenever a virtual method is called. But if you serialize that pointer on computer A, then send it across the network and deserialize and then try to use the reconstituted object on computer B, you'll invoke undefined behavior because there is no guarantee that that class's vtable will exist at the same memory location on computer B that it did on computer A. Also, any class that does any kind of dynamic memory allocation (e.g. any string class or container class) will contain pointers to other objects that it allocated, and that will lead you to the same invalid-pointer problem.

但是假设您将序列化限制为仅 POD (不包含指针的普通旧数据)对象.那它会起作用吗?答案是:可能,在非常特殊的情况下,但它会非常脆弱.原因是编译器可以自由地以不同的方式在内存中布置类的成员变量,并且它会在不同的硬件上以不同的方式插入填充(有时甚至使用不同的优化设置),从而导致字节表示计算机 A 上特定 Foo 对象的字节与表示计算机 B 上同一对象的字节不同.最重要的是,您可能不得不担心不同计算机上的不同字长(例如 long 是 32 位某些架构和其他 64 位),以及不同的字节序(例如,英特尔 CPU 以小端形式表示值,而 PowerPC CPU 通常以大端形式表示它们).这些差异中的任何一种都会导致您的接收计算机误解它接收到的字节,从而严重破坏您的数据.

But let's say you've limited your serializations to only POD (plain old Data) objects that contain no pointers. Will it work then? The answer is: possibly, in very specific cases, but it will be very fragile. The reason for that is that the compiler is free to lay out the class's member variables in memory in different ways, and it will insert padding differently on different hardware (or even with different optimization settings, sometimes), leading to a situation where the bytes that represent a particular Foo object on computer A are different from the bytes that would represent that same object on computer B. On top of that you may have to to worry about different word-lengths on different computers (e.g. long is 32-bit on some architectures and 64-bit on others), and different endian-ness (e.g. Intel CPUs represent values in little-endian form while PowerPC CPUs typically represent them in big-endian). Any one of these differences will cause your receiving computer to misinterpret the bytes it received and thereby corrupt your data badly.

所以剩下的问题是,序列化/反序列化 C++ 对象的正确方法是什么?答案是:你必须以艰难的方式来做,为每个类编写一个例程,逐个成员变量进行序列化成员变量,同时考虑到类的特定语义.例如,以下是一些您可以定义可序列化类的方法:

So the remaining part of the question is, what is the proper way to serialize/deserialize a C++ object? And the answer is: you have to do it the hard way, by writing a routine for each class that does the serialization member-variable by member-variable, taking the class's particular semantics into account. For example, here are some methods that you might have your serializable classes define:

// Serialize this object's state out into (buffer)
// (buffer) must point to at least FlattenedSize() bytes of writeable space
void Flatten(uint8_t *buffer) const;

// Return the number of bytes this object will require to serialize
size_t FlattenedSize() const;

// Set this object's state from the bytes in (buffer)
// Returns true on success, or false on failure
bool Unflatten(const uint8_t *buffer, size_t size);

...下面是一个实现方法的简单 x/y 点类的示例:

... and here's an example of a simple x/y point class that implements the methods:

class Point
{
public:
    Point() : m_x(0), m_y(0) {/* empty */}
    Point(int32_t x, int32_t y) : m_x(x), m_y(y) {/* empty */}

    void Flatten(uint8_t *buffer) const
    {
       const int32_t beX = htonl(m_x);
       memcpy(buffer, &beX, sizeof(beX));
       buffer += sizeof(beX);
       
       const int32_t beY = htonl(m_y);
       memcpy(buffer, &beY, sizeof(beY));
    }

    size_t FlattenedSize() const {return sizeof(m_x) + sizeof(m_y);}

    bool Unflatten(const uint8_t *buffer, size_t size)
    {
       if (size < FlattenedSize()) return false;

       int32_t beX;
       memcpy(&beX, buffer, sizeof(beX);
       m_x = ntohl(beX);

       buffer += sizeof(beX);
       int32_t beY;
       memcpy(&beY, buffer, sizeof(beY));
       m_y = ntohl(beY);

       return true;
    }

    int32_t m_x;
    int32_t m_y;
 };

...那么你的 unmarshal 函数可能看起来像这样(注意我已经将它模板化,以便它适用于任何实现上述方法的类):

... then your unmarshal function could look like this (note I've made it templated so that it will work for any class that implements the above methods):

/*! Takes a byte array and populates the object's data members */
template<class T> std::shared_ptr<T> unmarshal(const uint8_t *serialized, size_t size)
{
    auto sp = std::make_shared<T>();
    if (sp->Unflatten(serialized, size) == true) return sp;
 
    // Oops, Unflatten() failed!  handle the error somehow here
    [...]
}

如果与仅获取类对象的原始内存字节并通过网络逐字发送它们相比,这看起来需要大量工作,那么您是对的――确实如此.但是,如果您希望序列化能够可靠地工作并且不会在每次升级编译器、更改优化标志或想要在具有不同 CPU 架构的计算机之间进行通信时都中断,那么这就是您必须做的事情.如果您不想手动执行此类操作,可以使用预打包的库来帮助(部分)自动化流程,例如 Google 的 Protocol Buffers 库,甚至是旧的 XML.

If this seems like a lot of work compared to just grabbing the raw memory bytes of your class object and sending them verbatim across the wire, you're right -- it is. But this is what you have to do if you want the serialization to work reliably and not break every time you upgrade your compiler, or change your optimization flags, or want to communicate between computers with different CPU architectures. If you'd rather not do this sort of thing by hand, there are pre-packaged libraries to assist by with (partially) automating the process, such as Google's Protocol Buffers library, or even good old XML.

相关文章